Checking causalRisk estimators against standard R functions

Overview

To assess the validity of the estimators in the causalRisk package, we compare them to a variety of widely-used estimators from standard R packages. Note that because most of the functionality in the causalRisk package is not implemented in existing packages, we can only assess the validity of some of the functions.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and no treatment.

Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk of mortality by 20 months:

test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)
test_data$D = ifelse(!is.na(test_data$Death), 1, 0)
standard = survfit(Surv(Ymin,D)~1, data = test_data)
100*round(1-summary(standard, times=c(20))$surv, 5)

## [1] 45.886

Using the causalRisk package, we should get an estimated cumulative risk with similar reported precision.

models = specify_models(identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(example1, models,
                 label = c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment.

Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months as:

standard = survfit(Surv(Ymin,D) ~ Statin, data = test_data)
100*round(1 - summary(standard, times = c(20))$surv, 5)

## [1] 34.879 49.293

Using causalRisk we should get the same estimated cumulative risk within reported precision:

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(example1, models,
                 label=c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round =3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and maximum follow-up set.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(example1, models, tau = 21,
                 label=c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 3-level treatment.

Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months:

standard = survfit(Surv(Ymin, D) ~ StatinPotency, data = test_data)
100*round(1 - summary(standard, times = c(20))$surv, 5)

## [1] 34.879 49.670 48.927

Using causalRisk we should get the same estimated cumulative risk within reported precision.

models = specify_models(identify_treatment(StatinPotency),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj3 = estimate_ipwrisk(example1, models,
                 label = c("Kaplan-Meier check"))

make_table2(obj3, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment.

We assess the performance of causalRisk and an IP weighted Kaplan-Meier function in the survival package in a setting where they should both give consistent estimates (a null effect) in a simulated example with confounding but no non-informative censoring.

test_data = sim_func1(n=100000, eta1 = 0, eta3 = 0, seed = 123)
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)
test_data$D = ifelse(!is.na(test_data$Death), 1, 0)

x = model.matrix(~DxRisk, data = test_data)
psout = glmnet::glmnet(x, test_data$Statin,
                   family="multinomial", 
                   type.multinomial="grouped", lambda = 0)
ps = predict(psout, newx = x, type = "response", s = 0)[, 2,1]
iptw = (test_data$Statin=="Treat") / ps + (test_data$Statin=="Control") / (1-ps)
for(i in 1:2)
  iptw = 
    ifelse(test_data$Statin == levels(test_data$Statin)[i],
           iptw * length(iptw) / 
             sum(iptw[test_data$Statin == test_data$Statin[i]]), iptw)
standard = survfit(Surv(Ymin, D) ~ 1 + strata(Statin), data = test_data, 
                   type="kaplan-meier", weight = iptw)
100 * round(1 - summary(standard, times = c(20))$surv, 5)

## [1] 58.229 58.358

Using causalRisk we should get a result that is very close and also at the null.

models = specify_models(identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(test_data, models,
                 times = c(20), 
                 wt_type = "n",
                 label=c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking `causalRisk` with a 3-level treatment and confounding.

Using the same simulated data, we assess the performance of causalRisk when treatment has three levels.

Using causalRisk we should get a result that is also at the null.

models = specify_models(identify_treatment(StatinPotency, ~DxRisk),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(test_data, models,
                 times=c(20), 
                 label=c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment.

Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months:

test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      test_data$Discon, 
                      na.rm = TRUE)

test_data$D = ifelse(test_data$Ymin == test_data$Death, 1, 0)
test_data$D = ifelse(is.na(test_data$D),0,test_data$D)
standard = survfit(Surv(Ymin,D)~Statin, data = test_data)
100*round(1-summary(standard, times=c(20))$surv, 5)

## [1] 35.076 47.469

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_censoring(Discon),
                        identify_outcome(Death))

obj5 = estimate_ipwrisk(example1, models, 
                 label=c("Kaplan-Meier check"))

make_table2(obj5, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed.

Here the expected behavior is for subjects with all missing outcomes and censoring times to be censored at \(\tau\). Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months:

tau = 10
test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$Discon, 
                      tau,
                      na.rm = TRUE)

test_data$D = ifelse(test_data$Ymin == test_data$Death, 1, 0)
test_data$D = ifelse(is.na(test_data$D),0,test_data$D)
standard = survfit(Surv(Ymin,D)~Statin, data = test_data)
100*round(1-summary(standard, times=c(tau))$surv, 5)

## [1]  7.686 15.587

Using causalRisk we should get the same estimated cumulative risk within reported precision.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(Discon),
                        identify_outcome(Death))

obj5 = estimate_ipwrisk(example1, models, tau = tau,
                 label=c("Kaplan-Meier check"))

make_table2(obj5, risk_time = 10, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed and \(\tau\) is not specified

Here the expected behavior is for \(\tau\) to be set to the maximum observed failure or censoring times and subjects with all missing outcomes and censoring times to be censored at \(\tau\). Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months:

tau = max(test_data$Death,
                      test_data$Discon, 
                      na.rm = TRUE)
test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$Discon, 
                      tau,
                      na.rm = TRUE)

test_data$D = ifelse(test_data$Ymin == test_data$Death, 1, 0)
test_data$D = ifelse(is.na(test_data$D),0,test_data$D)
standard = survfit(Surv(Ymin,D)~Statin, data = test_data)
100*round(1-summary(standard, times=c(10))$surv, 5)

## [1]  7.686 15.587

Using causalRisk we should get the same estimated cumulative risk within reported precision.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(Discon),
                        identify_outcome(Death))

obj5 = estimate_ipwrisk(example1, models, 
                 label=c("Kaplan-Meier check"))

make_table2(obj5, risk_time = 10, risk_round = 3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator when there are tied event times.

We take the example data and round the death and censoring times to one decimal place to create many tied event and censoring times. We then use the Kaplan-Meier function in the survival pacakge to estimate the cumulative risk at 20 months:

test_data = example1
test_data$Death = round(test_data$Death,1)
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)

test_data$D = ifelse(!is.na(test_data$Death), 1, 0)
standard = survfit(Surv(Ymin, D) ~ Statin, data = test_data)
100*round(1-summary(standard, times = c(20))$surv, 5)

## [1] 35.006 49.361

Using causalRisk we estimate the same quantity and should get the same estimate to the reported precision.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj6 = estimate_ipwrisk(test_data, models,
                 label=c("Kaplan-Meier check"))

make_table2(obj6, risk_time = 20, risk_round =3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator when there are tied event and censoring times.

We take the example data and round the death times to one decimal place to create many tied event times. We then use the Kaplan-Meier function in the survival pacakge to estimate the cumulative risk at 20 months:

temp_data3 = example1
temp_data3$Death = round(temp_data3$Death, 1)
temp_data3$EndofEnrollment = round(temp_data3$EndofEnrollment, 1)
temp_data3$Ymin = pmin(temp_data3$Death,
                      temp_data3$EndofEnrollment, 
                      na.rm = TRUE)

temp_data3$D = ifelse(!is.na(temp_data3$Death), 1, 0)
standard = survfit(Surv(Ymin,D)~Statin, data = temp_data3, type = "fh2")
100*round(1-summary(standard, times=c(20))$surv, 5)

## [1] 34.901 49.294

Using causalRisk we get estimates that deviate only slightly:

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj6 = estimate_ipwrisk(temp_data3, models,
                 label=c("Kaplan-Meier check"))

make_table2(obj6, risk_time = 20, risk_round =3)

Confirm that `causalRisk` correctly handles formulae for composites censoring variables

The expected behavior is for causalRisk to take the formula associated with the first censoring event that is specified. Therefore the following analyses should return the same results.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment, ~DxRisk),
                        identify_censoring(Discon, ~Frailty),
                        identify_outcome(Death))

obj1 = estimate_ipwrisk(example1, models,
                        label=c("1"))

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment, ~DxRisk),
                        identify_censoring(Discon),
                        identify_outcome(Death))

obj2 = estimate_ipwrisk(example1, models,
                       label=c("2"))

make_table2(obj1, obj2, risk_time = 20, risk_round =3)

Checking whether `causalRisk` agrees with a non-parameteric cumulative risk estimator in the presence of a competing event.

Using the non-parametric cumulative incidence function estimator in the cmprsk package, we estimate the cumulative risk of CVD-related mortality at 20 months:

test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)
test_data$D = ifelse(!is.na(test_data$Death), 1, 0)
test_data$D = ifelse(test_data$CvdDeath == 1 & test_data$D == 1, 2, test_data$D)
cr_out = cuminc(test_data$Ymin, test_data$D, cencode=0)
100*round(cmprsk::timepoints(cr_out, 20)$est[2], 5)

## [1] 22.684

Using causalRisk we should get the same estimated cumulative risk to the reported precision.

models = specify_models(
                        identify_censoring(EndofEnrollment),
                        identify_competing_risk(CvdDeath, event_value = 1),
                        identify_outcome(Death))

cr_obj = estimate_ipwrisk(example1, models,
                 label=c("Aalen-Johansen check"))

make_table2(cr_obj, risk_time = 20, risk_round =3)

Confirming code returns same results as non-parameteric cumulative risk estimator in the presence of a competing event across two treatment groups.

Using the non-parametric cumulative incidence function estimator in the cmprsk package, we estimate the cumulative risk of CVD-related mortality at 20 months in each treatment group in the example data:

cr_out = cuminc(test_data$Ymin, test_data$D, test_data$Statin, cencode = 0)
100*round(cmprsk::timepoints(cr_out, 20)$est[c(3,4)], 5)

## [1] 17.661 24.223

Using causalRisk we should get the same estimated cumulative risk within reported precision.

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_competing_risk(CvdDeath, event_value = 1),
                        identify_outcome(Death))

cr_obj2 = estimate_ipwrisk(example1, models,
                 label=c("Aalen-Johansen check"))

make_table2(cr_obj2, risk_time = 20, risk_round =3)

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and a subgroup specified.

Using the Kaplan-Meier function in the survival pacakge, we estimate the cumulative risk at 20 months as:

test_data = example1
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)
test_data$D = ifelse(!is.na(test_data$Death), 1, 0)
standard = survfit(Surv(Ymin, D) ~ Statin, data = test_data, subset = DxRisk<1)
100*round(1-summary(standard, times=c(20))$surv, 5)

## [1] 36.673 49.817

Using causalRisk we should get the same estimated cumulative risk within reported precision:

models = specify_models(identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(example1, models,
                 label=c("Kaplan-Meier check"),
                 subgroup = DxRisk < 1)

make_table2(obj, risk_time = 20, risk_round =3)

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile.

As before, we make censoring non-informative and the treatment weights are normalized in causalRisk:

test_data = sim_func1(n=100000, eta1 = 0, eta3 = 0, seed = 123)
test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)
test_data$D = ifelse(!is.na(test_data$Death), 1, 0)

x = model.matrix(~DxRisk, data = test_data)
psout = glmnet::glmnet(x, test_data$Statin,
                   family="multinomial", 
                   type.multinomial="grouped", lambda = 0)
ps = predict(psout, newx = x, type = "response", s = 0)[, 2,1]
keeprecs = ifelse(ps > quantile(ps, 0.01) & ps < quantile(ps, 0.99), TRUE, FALSE) 
ps = ps[keeprecs]
test_data_trim = test_data[keeprecs,]
table(test_data_trim$Statin)

## 
## Control   Treat 
##   48978   49022

# re-estimate PS
x = model.matrix(~DxRisk, data = test_data_trim)
psout = glmnet::glmnet(x, test_data_trim$Statin,
                   family="multinomial", 
                   type.multinomial="grouped", lambda = 0)
ps = predict(psout, newx = x, type = "response", s = 0)[, 2, 1]

iptw = (test_data_trim$Statin=="Treat") / ps + (test_data_trim$Statin=="Control") / (1-ps)
standard = survfit(Surv(Ymin, D) ~ Statin, 
                   data = test_data_trim, weight = iptw)
100*round(1-summary(standard, times = c(20))$surv, 5)

## [1] 58.214 58.241

Using causalRisk we should get a result that has the same number of patients and the estimators are statistically similar.

models = specify_models(identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(test_data, models,
                 wt_type = "n",
                 trim = 0.01,
                 label=c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile w/ in a subgroup

As before, we make censoring non-informative and the treatment weights are normalized in causalRisk:

test_data = sim_func1(n = 100000, eta1 = 0, eta3 = 0)
test_data = test_data %>%
  filter(DxRisk < 1)

test_data$Ymin = pmin(test_data$Death,
                      test_data$EndofEnrollment, 
                      na.rm = TRUE)

test_data$D = ifelse(!is.na(test_data$Death), 1, 0)
x = model.matrix(~DxRisk, data = test_data)
psout = glmnet::glmnet(x, test_data$Statin,
                   family="multinomial", 
                   type.multinomial="grouped", lambda = 0)
ps = predict(psout, newx = x, type = "response", s = 0)[, 2,1]
keeprecs = ifelse(ps > quantile(ps, 0.01) & ps < quantile(ps, 0.99), TRUE, FALSE) 

test_data_trim = test_data %>%
  filter(keeprecs)
                    
table(test_data_trim$Statin)

## 
## Control   Treat 
##   34973   47461

# re-estimate PS
x = model.matrix(~DxRisk, data = test_data_trim)
psout = glmnet::glmnet(x, test_data_trim$Statin,
                   family="multinomial", 
                   type.multinomial="grouped", lambda = 0)
ps = predict(psout, newx = x, type = "response", s = 0)[, 2,1]

iptw = (test_data_trim$Statin=="Treat") / ps + (test_data_trim$Statin=="Control") / (1-ps)
standard = survfit(Surv(Ymin, D) ~ Statin, 
                   subset = DxRisk < 1,
                   data = test_data_trim, 
                   weight = iptw)
100*round(1-summary(standard, times=c(20))$surv, 5)

## [1] 59.667 59.857

Using causalRisk we should get a result with the same number of patients and results in an estimate that is close to the KM and at the null

models = specify_models(identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

obj = estimate_ipwrisk(test_data, models,
                 wt_type = "n",
                 trim = 0.01,
                 subgroup = DxRisk < 1,
                 label = c("Kaplan-Meier check"))

make_table2(obj, risk_time = 20, risk_round = 3)

Checking `causalRisk` with longitudinal data with time-varying predictors of censoring

Here we use the data set example1_timevar, which involves longitudinal data with a time-varying predictor of censoring. This data set is the same as example1 except that the data are discretized into two intervals and DxRisk is made to be time-varying, by adding a random variate to DxRisk that has a small positive expectation. Adjusting for DxRisk in the propensity score model and Frailty in the censoring model should result in the same estimate as estimate_ipwrisk applied to example1:

tv_models = specify_models(identify_interval(start_name = Start,
                                             end_name = Stop),
                           identify_treatment(Statin, ~DxRisk),
                           identify_censoring(EndofEnrollment, ~Frailty),
                           identify_outcome(Death),
                           identify_subject(id))

adj_tv = estimate_ipwrisk(example1_timevar, 
                                  tv_models,
                                  tau = 100,
                                  label = c("Adjusted-TV"))

make_table2(adj_tv, risk_time = 20, risk_round = 3)

models = specify_models(identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment, ~Frailty),
                        identify_outcome(Death),
                        identify_subject(id))

adj = estimate_ipwrisk(example1, 
                      models, 
                      tau = 100, 
                      label=c("Adjusted-baseline"))

make_table2(adj, risk_time = 20, risk_round = 3)

If we now adjust for DxRisk in the censoring model in both the time-varying and time-invariant estimator, we expect to that we should get two estimates that are unbiased, but numerically different from the previous estimators as well as each other as the inclusion of both the time-varying and time-invariant DxRisk will introduce some additional variability in the estimated censoring weights.

tv_models = specify_models(identify_interval(start_name = Start,
                                             end_name = Stop),
                           identify_treatment(Statin, ~DxRisk),
                           identify_censoring(EndofEnrollment, ~Frailty + DxRisk),
                           identify_outcome(Death),
                           identify_subject(id))

adj_tv = estimate_ipwrisk(example1_timevar, 
                                  tv_models,
                                  tau = 100, 
                                  label=c("Adjusted-TV"))

make_table2(adj_tv, risk_time = 20, risk_round =3)

models = specify_models(identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment, ~Frailty + DxRisk),
                        identify_outcome(Death),
                        identify_subject(id))

adj = estimate_ipwrisk(example1, 
                      models, 
                      tau = 100, 
                      label=c("Adjusted-baseline"))

make_table2(adj, risk_time = 20, risk_round = 3)

Checking whether `causalRisk` with data structured longitudinally agrees with equivalent data structured non-longitudinally

Here we use the longitudinal data set example1_timevar, which is equivalent to example1.

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results1 = estimate_ipwrisk(example1, models, tau = 20, label=c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results2 = estimate_ipwrisk(example1_timevar, models, tau = 20, label=c("Time-varying"))

make_table2(results1, results2, risk_time = 20)

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results1 = estimate_ipwrisk(example1, models, tau = 20, label=c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results2 = estimate_ipwrisk(example1_timevar, models, tau = 20,label=c("Time-varying"))

make_table2(results1, results2, risk_time = 20)

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_censoring(EndofEnrollment, ~Frailty),
                        identify_outcome(Death))

results1 = estimate_ipwrisk(example1, models, tau = 20, label = c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment, ~Frailty),
                        identify_outcome(Death))

results2 = estimate_ipwrisk(example1_timevar, models, tau = 20, label = c("Time-varying"))

make_table2(results1, results2, risk_time = 20)

Checking whether `causalRisk` with data structured longitudinally agrees with equivalent data structured non-longitudinally when there are two sources or right censoring.

Here we use the longitudinal data set example1_timevar, which is equivalent to example1.

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_censoring(Discon),
                        identify_outcome(Death))

results1 = estimate_ipwrisk(example1, models, tau = 20, label=c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_censoring(Discon),
                        identify_outcome(Death))

results2 = estimate_ipwrisk(example1_timevar, models, tau = 20, label=c("Time-varying"))

make_table2(results1, results2, risk_time = 20)

Checking whether a subgroup result defined in longitudinal data agrees with an equivalent subgroup defined in time-fixed data

Here again we compare results from the longitudinal data set example1_timevar, which is equivalent to example1.

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results1 = estimate_ipwrisk(example1, models, subgroup = Frailty < 0,
                            tau = 20, label=c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

results2 = estimate_ipwrisk(example1_timevar, models, subgroup = Frailty < 0,
                            tau = 20, label=c("Time-varying"))

make_table2(results1, results2, risk_time = 20)

Checking observation weights

When supplied by a user observation weights should be normalized to sum to \(n\). Therefore, if any constant is used as an observation weight, the weighted analysis should be identical to an unweighted analysis.

temp = example1
temp$obs_weight = rep(10,nrow(temp))

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

unweighted = estimate_ipwrisk(temp,
                       models,
                       labels = c("Unweighted"))

weighted = estimate_ipwrisk(temp,
                       models,
                       labels = c("Weighted"),
                       wt = obs_weight)

make_table2(unweighted, weighted, risk_time = 20)

The same should be true within a subgroup:

temp = example1
temp$obs_weight = rep(10,nrow(temp))

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

unweighted = estimate_ipwrisk(temp,
                       models,
                       subgroup = Frailty < 0,
                       labels = c("Unweighted"))

weighted = estimate_ipwrisk(temp,
                       models,
                       subgroup = Frailty < 0,
                       labels = c("Weighted"),
                       wt = obs_weight)

make_table2(unweighted, weighted, risk_time = 20)

Also, if random uniform weights are supplied as observation weights, the weighted analysis should result in an estimator that has the same expectation and variance.

temp = example1
temp$obs_weight = runif(10000)

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

unweighted = estimate_ipwrisk(temp,
                       models,
                       labels = c("Unweighted"))

weighted = estimate_ipwrisk(temp,
                       models,
                       labels = c("Weighted"),
                       wt = obs_weight)

make_table2(unweighted, weighted, risk_time = 20)

When applied to longitudinal data, weights should be normalized to sum to \(n\), not the number of rows in a longitudinal dataset.

temp = example1
temp$obs_weight = runif(10000)

temp_timevar = temp %>% dplyr::select(id, obs_weight) %>%
  dplyr::right_join(example1_timevar)

## Joining with `by = join_by(id)`

models = specify_models(identify_subject(id),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

weighted = estimate_ipwrisk(temp,
                       models,
                       tau = 30,
                       labels = c("Weighted, time-fixed"),
                       wt = obs_weight)

models = specify_models(identify_subject(id),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

weighted_long = estimate_ipwrisk(temp_timevar,
                       models,
                       tau = 30,
                       labels = c("Weighted, time-varying"),
                       wt = obs_weight)

make_table2(weighted, weighted_long, risk_time = 20)

AIPW checks

Though there are very few existing implementations of augmented inverse probability weighted (AIPW) estimators to compare to, the authors of the reference we used to develop our efficient AIPW estimator have an R package, riskRegression, that we can benchmark against. Here, we provide several comparisons between our results and the riskRegression package.

To begin, we compare the results between causalRisk and riskRegression when null models are used for all of the nuisance models. To improve runtime, we will use a sample of the example1 dataset included in causalRisk for comparison. As can be seen below, the results are extremely similar between the two packages.

# Sample Data
set.seed(7368)
dat <- example1 %>% sample_frac(.1)

# Conduct analysis with causalRisk
models_null = specify_models(identify_treatment(Statin),
                             identify_censoring(EndofEnrollment),
                             identify_outcome(Death))

fit_aipw_null = estimate_aipwrisk(dat, 
                             models_null,
                             fully_efficient = TRUE,
                             times = 24,
                             tau_censor = TRUE,
                             labels = c("Null models"))

make_table2(fit_aipw_null)

# Conduct analysis with riskRegression
dat2 <- dat %>%
  mutate(t = pmin(Death, EndofEnrollment, 24, na.rm = TRUE),
         outcome = ifelse(is.na(Death), 0, t == Death),
         censor = ifelse(is.na(EndofEnrollment), 0, t == EndofEnrollment))
 
censmod <- survival::coxph(Surv(t, outcome==0) ~ strata(Statin), data = dat2, x = TRUE)
outmod <- survival::coxph(Surv(t, outcome) ~ strata(Statin), data = dat2, x = TRUE)
psmod <- glm(Statin ~ 1, family = binomial(link = "logit"), data =dat2)

result <- ate(event = outmod,
               treatment = psmod,
               censor = censmod,
               estimator = "AIPTW",
               data = dat2,
               times = 24,
               cause = 1,
               verbose = FALSE)

result$diffRisk %>% kable()

estimator	time	A	estimate.A	B	estimate.B	estimate	se	lower	upper	p.value
AIPTW	24	Control	0.4285414	Treat	0.555398	0.1268566	0.0563232	0.0164652	0.237248	0.0243034

Unfortunately, it is impossible to specify the models for riskRegression the same way as in causalRisk. In causalRisk, the censoring and outcome models are fit seperately in each treatment group. In riskRegression, treatment is included as a covariate in the models. The closest we can get is to stratify the models by treatment for riskRegression. With that, the results are still very similar to causalRisk.

# Conduct analysis with causalRisk
models_adj = specify_models(identify_treatment(Statin, ~DxRisk),
                             identify_censoring(EndofEnrollment, ~DxRisk),
                             identify_outcome(Death, ~DxRisk))

fit_aipw_adj = estimate_aipwrisk(dat, 
                             models_adj,
                             fully_efficient = TRUE,
                             times = 24,
                             tau_censor = TRUE,
                             labels = c("Adjusted models"))

make_table2(fit_aipw_adj)

# Conduct analysis with riskRegression
censmod <- survival::coxph(Surv(t, outcome==0) ~ DxRisk + strata(Statin), data = dat2, x = TRUE)
outmod <- survival::coxph(Surv(t, outcome) ~ DxRisk + strata(Statin), data = dat2, x = TRUE)
psmod <- glm(Statin ~ DxRisk, family = binomial(link = "logit"), data =dat2)

result2 <- ate(event = outmod,
               treatment = psmod,
               censor = censmod,
               estimator = "AIPTW",
               data = dat2,
               times = 24,
               cause = 1,
               verbose = FALSE)

result2$diffRisk %>% kable()

estimator	time	A	estimate.A	B	estimate.B	estimate	se	lower	upper	p.value
AIPTW	24	Control	0.529929	Treat	0.5426873	0.0127583	0.0837316	-0.1513527	0.1768693	0.8788941

Again, there is a difference in how riskRegression and causalRisk handle competing risks. riskRegression uses a multistate Cox model, while causalRisk uses a Fine and Gray model. Nonetheless, the results remain similar.

# Conduct analysis with causalRisk
models_cr = specify_models(identify_treatment(Statin, ~DxRisk),
                           identify_censoring(EndofEnrollment, ~DxRisk),
                           identify_outcome(Death, ~DxRisk),
                           identify_competing_risk(CvdDeath))

fit_aipw_cr = estimate_aipwrisk(dat, 
                             models_cr,
                             fully_efficient = TRUE,
                             times = 24,
                             tau_censor = TRUE,
                             labels = c("Adjusted models"))

make_table2(fit_aipw_cr)

# Setup data
dat3 <- dat2 %>%
  mutate(outcome = ifelse(outcome == 0, 0,
                          ifelse(CvdDeath == 0, 1, 2)),
         outcome = factor(outcome))

# Conduct analysis with riskRegression
censmod <- survival::coxph(Surv(t, outcome == 0) ~ DxRisk + strata(Statin), data = dat3, x = TRUE, y = TRUE)
outmod <- riskRegression::CSC(Hist(t, outcome) ~ DxRisk + strata(Statin), data = dat3)
psmod <- glm(Statin ~ DxRisk, family = binomial(link = "logit"), data =dat3)

result3 <- ate(event = outmod,
               treatment = psmod,
               censor = censmod,
               estimator = "AIPTW",
               data = dat3,
               times = 24,
               cause = 2,
               verbose = FALSE)

result3$diffRisk %>% kable()

estimator	time	A	estimate.A	B	estimate.B	estimate	se	lower	upper	p.value
AIPTW	24	Control	0.4098683	Treat	0.2475148	-0.1623535	0.0905999	-0.3399261	0.0152191	0.0731358

Table 1 checks

make_table1 applied to the same models fit to example1 and example1_timevar should result in identical tables.

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

time_fixed = estimate_ipwrisk(example1,
                            models,
                            tau = 30,
                            labels = c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_treatment(Statin),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

time_varying = estimate_ipwrisk(example1_timevar,
                                 models,
                                 tau = 30,
                                 labels = c("Time-varying"))

make_table1(time_fixed, DxRisk, Sex, smd = TRUE)

Characteristic¹	Control n=10000.00	Treat n=10000.00	SMD
DxRisk, Mean (SD)	0.74 (0.85)	-0.35 (0.88)	1.26010726497173
Sex			0.0209159130770583
Female	4939 49.39	4835 48.35
Male	5061 50.61	5165 51.65
¹ All values are N (%) unless otherwise specified

make_table1(time_varying, DxRisk, Sex, smd = TRUE)

Characteristic¹	Control n=10000.00	Treat n=10000.00	SMD
DxRisk, Mean (SD)	0.74 (0.85)	-0.35 (0.88)	1.26010726497173
Sex			0.0209159130770583
Female	4939 49.39	4835 48.35
Male	5061 50.61	5165 51.65
¹ All values are N (%) unless otherwise specified

models = specify_models(identify_subject(id),
                        identify_treatment(Statin, ~DxRisk),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

time_fixed = estimate_ipwrisk(example1,
                            models,
                            tau = 30,
                            labels = c("Time-fixed"))

models = specify_models(identify_subject(id),
                        identify_treatment(Statin, ~DxRisk),
                        identify_interval(Start, Stop),
                        identify_censoring(EndofEnrollment),
                        identify_outcome(Death))

time_varying = estimate_ipwrisk(example1_timevar,
                                 models,
                                 tau = 30,
                                 labels = c("Time-varying"))

make_table1(time_fixed, DxRisk, Sex, smd = TRUE)

Characteristic¹	Control n=9953.53	Treat n=9908.35	SMD
DxRisk, Mean (SD)	0.02 (0.97)	-0.03 (0.97)	0.0586710589322605
Sex			0.00471104495627606
Female	4815 48.37	4769 48.13
Male	5139 51.63	5139 51.87
¹ All values are N (%) unless otherwise specified

make_table1(time_varying, DxRisk, Sex, smd = TRUE)

Characteristic¹	Control n=9953.53	Treat n=9908.35	SMD
DxRisk, Mean (SD)	0.02 (0.97)	-0.03 (0.97)	0.0586710589322605
Sex			0.00471104495627606
Female	4815 48.37	4769 48.13
Male	5139 51.63	5139 51.87
¹ All values are N (%) unless otherwise specified

Overview

Checking whether causalRisk agrees with a Kaplan-Meier estimator with a single source of right censoring and no treatment.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and maximum follow-up set.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with a single source of right censoring and a 3-level treatment.

Checking whether causalRisk agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment.

Checking causalRisk with a 3-level treatment and confounding.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed and \(\tau\) is not specified

Checking whether causalRisk agrees with a Kaplan-Meier estimator when there are tied event times.

Checking whether causalRisk agrees with a Kaplan-Meier estimator when there are tied event and censoring times.

Confirm that causalRisk correctly handles formulae for composites censoring variables

Checking whether causalRisk agrees with a non-parameteric cumulative risk estimator in the presence of a competing event.

Confirming code returns same results as non-parameteric cumulative risk estimator in the presence of a competing event across two treatment groups.

Checking whether causalRisk agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and a subgroup specified.

Checking whether causalRisk agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile.

Checking whether causalRisk agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile w/ in a subgroup

Checking causalRisk with longitudinal data with time-varying predictors of censoring

Checking whether causalRisk with data structured longitudinally agrees with equivalent data structured non-longitudinally

Checking whether causalRisk with data structured longitudinally agrees with equivalent data structured non-longitudinally when there are two sources or right censoring.

Checking whether a subgroup result defined in longitudinal data agrees with an equivalent subgroup defined in time-fixed data

Checking observation weights

AIPW checks

Table 1 checks

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and no treatment.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and maximum follow-up set.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 3-level treatment.

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment.

Checking `causalRisk` with a 3-level treatment and confounding.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with two sources of right censoring and a 2-level treatment when some event times are not observed and \(\tau\) is not specified

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator when there are tied event times.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator when there are tied event and censoring times.

Confirm that `causalRisk` correctly handles formulae for composites censoring variables

Checking whether `causalRisk` agrees with a non-parameteric cumulative risk estimator in the presence of a competing event.

Checking whether `causalRisk` agrees with a Kaplan-Meier estimator with a single source of right censoring and a 2-level treatment and a subgroup specified.

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile.

Checking whether `causalRisk` agrees with an inverse-probability of treatment weighted Kaplan-Meier estimator with a 2-level treatment w/ trimming at the first percentile w/ in a subgroup

Checking `causalRisk` with longitudinal data with time-varying predictors of censoring

Checking whether `causalRisk` with data structured longitudinally agrees with equivalent data structured non-longitudinally

Checking whether `causalRisk` with data structured longitudinally agrees with equivalent data structured non-longitudinally when there are two sources or right censoring.