NOTE: The terms “vector” and “variable” are (mostly) used interchangably in this document.

# Design Goals

stype provides an extensible set of data types that in themselves extend certain R vector classes to be useful in a variety of analytic applications by providing vectors with:

• data types that align with common statistical data types (e.g. “binary”, “continous”, “count”, “time to event”, etc);
• a context object with information about how the variable relates to a study design;
• a data_summary with relevant summary statistics of the variable;
• methods that update the data_summary when a vector is subset or modified in certain ways;
• various useful utilities such as predicate functions for identifying variables based on the data type or context.

## Implementation

The package relies heavily on the vctrs package whose goals are:

• To propose vec_size() and vec_type() as alternatives to length() and class(); vignette("type-size"). These definitions are paired with a framework for type-coercion and size-recycling.
• To define type- and size-stability as desirable function properties, use them to analyse existing base function, and to propose better alternatives; vignette("stability"). This work has been particularly motivated by thinking about the ideal properties of c(), ifelse(), and rbind().
• To provide a new vctr base class that makes it easy to create new S3 vectors; vignette("s3-vector"). vctrs provides methods for many base generics in terms of a few new vctrs generics, making implementation considerably simpler and more robust.

## A quick example

Each data type provided by stype (described in more detail below) have constructor functions that begin with v_<type>. For example, v_binary creates binary ($$\{0, 1\}$$) data from R’s logical type.

library(stype)

x <- v_binary(c(TRUE, TRUE, TRUE, FALSE))

str(x)
#>  bnry [1:4] 1, 1, 1, 0
#>  @ internal_name    : chr ""
#>  @ data_summary     :Formal class 'data_summary' [package "stype"] with 2 slots
#>   .. ..@ .Data:List of 10
#>   .. .. ..$: int 4 #> .. .. ..$ : logi FALSE
#>   .. .. ..$: int 4 #> .. .. ..$ : int 0
#>   .. .. ..$: num 0 #> .. .. ..$ : logi FALSE
#>   .. .. ..$: int 1 #> .. .. ..$ : int 3
#>   .. .. ..$: num 0.75 #> .. .. ..$ : num 0.25
#>   .. ..@ names: chr [1:10] "n" "has_missing" "n_nonmissing" "n_missing" ...
#>  @ context          :Formal class 'context' [package "stype"] with 6 slots
#>   .. ..@ short_label  : chr ""
#>   .. ..@ long_label   : chr ""
#>   .. ..@ description  : chr ""
#>   .. ..@ derivation   : chr ""
#>   .. ..@ purpose      :Formal class 'purpose' [package "stype"] with 2 slots
#>   .. .. .. ..@ study_role: chr ""
#>   .. .. .. ..@ tags      : chr ""
#>   .. ..@ security_type: chr ""
#>  @ extra_descriptors: list()

The v_binary data type prints 0s and 1s but the underlying data is logical:

x
#> <binary[4]>
#> [1] 1 1 1 0
#> Proportion = 0.750
vctrs::vec_data(x)
#> [1]  TRUE  TRUE  TRUE FALSE

The data type includes some useful utilities such as prettying certain parts of the description (here the proportion) and a predicate function.

is_binary(x)
#> [1] TRUE

Certain math operations work and pull directly from the description where appropriate (rather than recomputing). Note these operations are still under development and should be used with caution:

mean(x)
#> [1] 0.75
sum(x)
#> [1] 3

# sum(x, x) # See? very experimental

Other math/arithmetic operations don’t work:

# What do you mean you want to add binary and integer?
x + 2L
#> Error: Can't convert from <integer> to <logical> due to loss of precision.
#> * Locations: 1

# R's base types are not so safe
vctrs::vec_data(x) + 2L
#> [1] 3 3 3 2

Logical operators work as one might expect:

!x
#> <binary[4]>
#> [1] 0 0 0 1
#> Proportion = 0.250
all(x)
#> [1] FALSE
any(x)
#> [1] TRUE

Here’s where the real magic is.

# vectors can be combined and ...
# subsetting maintains and updates attributes
c(x, !x[1:3])
#> <binary[7]>
#> [1] 1 1 1 0 0 0 0
#> Proportion = 0.429

# But ...
c(x, v_binary(context = context(purpose = purpose(study_role = "other"))))
#> Error: All purpose elements must be equal in order to combine stypes.

# Data types

The following table describes the proposed data types (not all of these may be available at this time). A – indicates that the type inherits properties from the level above..

v_<type> prototype support
v_binary logical $$\{0, 1\}$$
v_count integer $$(0, 1, 2, \dots)$$
v_continuous double $$\mathcal{R}$$
v_continuous_nonneg double $$\mathcal{R}^+$$
v_event_time double $$\mathcal{R}^+$$
v_nominal factor
v_ordered ordered
v_date Date
v_character character

# Usage in tibble

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
library(tibble)
#>
#> Attaching package: 'tibble'
#> The following object is masked from 'package:stype':
#>
#>     view
n <- 100

make_context <- function(role){
context(purpose = purpose(study_role = role))
}

covariates <-
purrr::map(
.x  = purrr::set_names(1:10, paste0("x", 1:10)),
.f = ~ v_binary(as.logical(rbinom(n, 1, 0.25)),
context = make_context("covariate"))
)

dt <- tibble(
y1 = v_binary(as.logical(rbinom(n, 1, 0.25)), context = make_context("outcome")),
y2 = v_event_time(runif(n, 1, 100), context = make_context("outcome")),
y3 = v_continuous(rnorm(n), context = make_context("outcome")),
!!! covariates
)

dt
#> # A tibble: 100 x 13
#>       y1    y2      y3    x1    x2    x3    x4    x5    x6    x7    x8    x9
#>    <bnr> <tme>  <cont> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr>
#>  1     0  26.9   0.478     1     0     1     0     0     0     1     1     0
#>  2     0    95    1.62     0     1     0     0     0     0     1     0     0
#>  3     0  55.1   0.175     0     0     1     1     0     1     0     1     1
#>  4     0    18  -0.436     0     0     1     1     0     0     1     1     1
#>  5     0  86.7    3.08     0     1     1     0     1     1     0     0     0
#>  6     1  11.3   0.312     0     0     0     0     0     1     0     0     1
#>  7     0  67.4   0.894     0     1     0     0     0     0     0     1     1
#>  8     0  30.2    1.53     0     0     1     0     0     0     0     0     1
#>  9     1  88.7 -0.0933     1     0     1     0     0     0     1     0     0
#> 10     0  72.5 -0.0954     0     0     0     0     0     1     1     0     1
#> # … with 90 more rows, and 1 more variable: x10 <bnry>

Selecting columns based on data type:

dt %>% select_if(is_binary)
#> # A tibble: 100 x 11
#>        y1     x1     x2     x3     x4     x5     x6     x7     x8     x9    x10
#>    <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry>
#>  1      0      1      0      1      0      0      0      1      1      0      1
#>  2      0      0      1      0      0      0      0      1      0      0      0
#>  3      0      0      0      1      1      0      1      0      1      1      0
#>  4      0      0      0      1      1      0      0      1      1      1      0
#>  5      0      0      1      1      0      1      1      0      0      0      0
#>  6      1      0      0      0      0      0      1      0      0      1      0
#>  7      0      0      1      0      0      0      0      0      1      1      0
#>  8      0      0      0      1      0      0      0      0      0      1      0
#>  9      1      1      0      1      0      0      0      1      0      0      1
#> 10      0      0      0      0      0      0      1      1      0      1      1
#> # … with 90 more rows

Selecting columns based on context:

dt %>% select_if(is_outcome)
#> # A tibble: 100 x 3
#>        y1     y2      y3
#>    <bnry> <tmev>  <cont>
#>  1      0   26.9   0.478
#>  2      0     95    1.62
#>  3      0   55.1   0.175
#>  4      0     18  -0.436
#>  5      0   86.7    3.08
#>  6      1   11.3   0.312
#>  7      0   67.4   0.894
#>  8      0   30.2    1.53
#>  9      1   88.7 -0.0933
#> 10      0   72.5 -0.0954
#> # … with 90 more rows