NOTE: The terms “vector” and “variable” are (mostly) used interchangably in this document.

Design Goals

stype provides an extensible set of data types that in themselves extend certain R vector classes to be useful in a variety of analytic applications by providing vectors with:

  • data types that align with common statistical data types (e.g. “binary”, “continous”, “count”, “time to event”, etc);
  • a context object with information about how the variable relates to a study design;
  • a data_summary with relevant summary statistics of the variable;
  • methods that update the data_summary when a vector is subset or modified in certain ways;
  • various useful utilities such as predicate functions for identifying variables based on the data type or context.

Implementation

The package relies heavily on the vctrs package whose goals are:

  • To propose vec_size() and vec_type() as alternatives to length() and class(); vignette("type-size"). These definitions are paired with a framework for type-coercion and size-recycling.
  • To define type- and size-stability as desirable function properties, use them to analyse existing base function, and to propose better alternatives; vignette("stability"). This work has been particularly motivated by thinking about the ideal properties of c(), ifelse(), and rbind().
  • To provide a new vctr base class that makes it easy to create new S3 vectors; vignette("s3-vector"). vctrs provides methods for many base generics in terms of a few new vctrs generics, making implementation considerably simpler and more robust.

A quick example

Each data type provided by stype (described in more detail below) have constructor functions that begin with v_<type>. For example, v_binary creates binary (\(\{0, 1\}\)) data from R’s logical type.

library(stype)

x <- v_binary(c(TRUE, TRUE, TRUE, FALSE))

str(x)
#>  bnry [1:4] 1, 1, 1, 0
#>  @ internal_name    : chr ""
#>  @ data_summary     :Formal class 'data_summary' [package "stype"] with 2 slots
#>   .. ..@ .Data:List of 10
#>   .. .. ..$ : int 4
#>   .. .. ..$ : logi FALSE
#>   .. .. ..$ : int 4
#>   .. .. ..$ : int 0
#>   .. .. ..$ : num 0
#>   .. .. ..$ : logi FALSE
#>   .. .. ..$ : int 1
#>   .. .. ..$ : int 3
#>   .. .. ..$ : num 0.75
#>   .. .. ..$ : num 0.25
#>   .. ..@ names: chr [1:10] "n" "has_missing" "n_nonmissing" "n_missing" ...
#>  @ context          :Formal class 'context' [package "stype"] with 6 slots
#>   .. ..@ short_label  : chr ""
#>   .. ..@ long_label   : chr ""
#>   .. ..@ description  : chr ""
#>   .. ..@ derivation   : chr ""
#>   .. ..@ purpose      :Formal class 'purpose' [package "stype"] with 2 slots
#>   .. .. .. ..@ study_role: chr ""
#>   .. .. .. ..@ tags      : chr ""
#>   .. ..@ security_type: chr ""
#>  @ extra_descriptors: list()

The v_binary data type prints 0s and 1s but the underlying data is logical:

x
#> <binary[4]>
#> [1] 1 1 1 0
#> Proportion = 0.750
vctrs::vec_data(x)
#> [1]  TRUE  TRUE  TRUE FALSE

The data type includes some useful utilities such as prettying certain parts of the description (here the proportion) and a predicate function.

is_binary(x)
#> [1] TRUE

Certain math operations work and pull directly from the description where appropriate (rather than recomputing). Note these operations are still under development and should be used with caution:

mean(x)
#> [1] 0.75
sum(x)
#> [1] 3

# sum(x, x) # See? very experimental

Other math/arithmetic operations don’t work:

# What do you mean you want to add binary and integer?
x + 2L
#> Error: Can't convert from <integer> to <logical> due to loss of precision.
#> * Locations: 1

# R's base types are not so safe
vctrs::vec_data(x) + 2L
#> [1] 3 3 3 2

Logical operators work as one might expect:

!x
#> <binary[4]>
#> [1] 0 0 0 1
#> Proportion = 0.250
all(x)
#> [1] FALSE
any(x)
#> [1] TRUE

Here’s where the real magic is.

# vectors can be combined and ...
# subsetting maintains and updates attributes
c(x, !x[1:3])
#> <binary[7]>
#> [1] 1 1 1 0 0 0 0
#> Proportion = 0.429

# But ...
c(x, v_binary(context = context(purpose = purpose(study_role = "other"))))
#> Error: All purpose elements must be equal in order to combine stypes.

Data types

The following table describes the proposed data types (not all of these may be available at this time). A – indicates that the type inherits properties from the level above..

v_<type> prototype support
v_binary logical \(\{0, 1\}\)
v_count integer \((0, 1, 2, \dots)\)
v_continuous double \(\mathcal{R}\)
v_continuous_nonneg double \(\mathcal{R}^+\)
v_event_time double \(\mathcal{R}^+\)
v_nominal factor
v_ordered ordered
v_date Date
v_character character

Usage in tibble

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
#> 
#> Attaching package: 'tibble'
#> The following object is masked from 'package:stype':
#> 
#>     view
n <- 100

make_context <- function(role){
  context(purpose = purpose(study_role = role))
}

covariates <-
purrr::map(
    .x  = purrr::set_names(1:10, paste0("x", 1:10)),
    .f = ~ v_binary(as.logical(rbinom(n, 1, 0.25)),
                    context = make_context("covariate"))
)

dt <- tibble(
  y1 = v_binary(as.logical(rbinom(n, 1, 0.25)), context = make_context("outcome")),
  y2 = v_event_time(runif(n, 1, 100), context = make_context("outcome")),
  y3 = v_continuous(rnorm(n), context = make_context("outcome")),
  !!! covariates
)

dt
#> # A tibble: 100 x 13
#>       y1    y2      y3    x1    x2    x3    x4    x5    x6    x7    x8    x9
#>    <bnr> <tme>  <cont> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr>
#>  1     0  26.9   0.478     1     0     1     0     0     0     1     1     0
#>  2     0    95    1.62     0     1     0     0     0     0     1     0     0
#>  3     0  55.1   0.175     0     0     1     1     0     1     0     1     1
#>  4     0    18  -0.436     0     0     1     1     0     0     1     1     1
#>  5     0  86.7    3.08     0     1     1     0     1     1     0     0     0
#>  6     1  11.3   0.312     0     0     0     0     0     1     0     0     1
#>  7     0  67.4   0.894     0     1     0     0     0     0     0     1     1
#>  8     0  30.2    1.53     0     0     1     0     0     0     0     0     1
#>  9     1  88.7 -0.0933     1     0     1     0     0     0     1     0     0
#> 10     0  72.5 -0.0954     0     0     0     0     0     1     1     0     1
#> # … with 90 more rows, and 1 more variable: x10 <bnry>

Selecting columns based on data type:

dt %>% select_if(is_binary)
#> # A tibble: 100 x 11
#>        y1     x1     x2     x3     x4     x5     x6     x7     x8     x9    x10
#>    <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry>
#>  1      0      1      0      1      0      0      0      1      1      0      1
#>  2      0      0      1      0      0      0      0      1      0      0      0
#>  3      0      0      0      1      1      0      1      0      1      1      0
#>  4      0      0      0      1      1      0      0      1      1      1      0
#>  5      0      0      1      1      0      1      1      0      0      0      0
#>  6      1      0      0      0      0      0      1      0      0      1      0
#>  7      0      0      1      0      0      0      0      0      1      1      0
#>  8      0      0      0      1      0      0      0      0      0      1      0
#>  9      1      1      0      1      0      0      0      1      0      0      1
#> 10      0      0      0      0      0      0      1      1      0      1      1
#> # … with 90 more rows

Selecting columns based on context:

dt %>% select_if(is_outcome)
#> # A tibble: 100 x 3
#>        y1     y2      y3
#>    <bnry> <tmev>  <cont>
#>  1      0   26.9   0.478
#>  2      0     95    1.62
#>  3      0   55.1   0.175
#>  4      0     18  -0.436
#>  5      0   86.7    3.08
#>  6      1   11.3   0.312
#>  7      0   67.4   0.894
#>  8      0   30.2    1.53
#>  9      1   88.7 -0.0933
#> 10      0   72.5 -0.0954
#> # … with 90 more rows