NOTE: The terms “vector” and “variable” are (mostly) used interchangably in this document.
stype
provides an extensible set of data types that in
themselves extend certain R
vector classes to be useful in
a variety of analytic applications by providing vectors with:
context
object with information about how the
variable relates to a study design;data_summary
with relevant summary statistics of the
variable;data_summary
when a vector is
subset or modified in certain ways;context
.The package relies heavily on the vctrs
package
whose goals are:
vec_size()
and vec_type()
as
alternatives to length()
and class()
;
vignette("type-size")
. These definitions are paired with a
framework for type-coercion and size-recycling.vignette("stability")
. This work has been
particularly motivated by thinking about the ideal properties of
c()
, ifelse()
, and rbind()
.vctr
base class that makes it easy to
create new S3 vectors; vignette("s3-vector")
. vctrs
provides methods for many base generics in terms of a few new vctrs
generics, making implementation considerably simpler and more
robust.Each data type provided by stype
(described in more
detail below) have constructor functions that begin with
v_<type>
. For example, v_binary
creates
binary (\(\{0, 1\}\)) data from R’s
logical
type.
library(stype)
x <- v_binary(c(TRUE, TRUE, TRUE, FALSE))
str(x)
#> bnry [1:4] 1, 1, 1, 0
#> @ internal_name : chr ""
#> @ data_summary :Formal class 'data_summary' [package "stype"] with 2 slots
#> .. ..@ .Data:List of 10
#> .. .. ..$ : int 4
#> .. .. ..$ : logi FALSE
#> .. .. ..$ : int 4
#> .. .. ..$ : int 0
#> .. .. ..$ : num 0
#> .. .. ..$ : logi FALSE
#> .. .. ..$ : int 1
#> .. .. ..$ : int 3
#> .. .. ..$ : num 0.75
#> .. .. ..$ : num 0.25
#> .. ..@ names: chr [1:10] "n" "has_missing" "n_nonmissing" "n_missing" ...
#> @ context :Formal class 'context' [package "stype"] with 6 slots
#> .. ..@ short_label : chr ""
#> .. ..@ long_label : chr ""
#> .. ..@ description : chr ""
#> .. ..@ derivation : chr ""
#> .. ..@ purpose :Formal class 'purpose' [package "stype"] with 2 slots
#> .. .. .. ..@ study_role: chr(0)
#> .. .. .. ..@ tags : chr(0)
#> .. ..@ security_type: chr ""
#> @ extra_descriptors : list()
#> @ auto_compute_summary: logi TRUE
#> @ stype_version :Classes 'package_version', 'numeric_version' hidden list of 1
#> ..$ : int [1:3] 0 5 1
The v_binary
data type prints 0
s and
1
s but the underlying data is logical
:
x
#> <binary[4]>
#> [1] 1 1 1 0
#> Proportion = 0.750
vctrs::vec_data(x)
#> [1] TRUE TRUE TRUE FALSE
The data type includes some useful utilities such as prettying
certain parts of the description
(here the proportion) and
a predicate function.
is_binary(x)
#> [1] TRUE
Certain math operations work and pull directly from the
description
where appropriate (rather than recomputing).
Note these operations are still under development and should be used
with caution:
Other math/arithmetic operations don’t work:
# What do you mean you want to add binary and integer?
x + 2L
#> Error in `vec_arith()`:
#> ! <binary> + <integer> is not permitted
# R's base types are not so safe
vctrs::vec_data(x) + 2L
#> [1] 3 3 3 2
Logical operators work as one might expect:
Here’s where the real magic is.
# vectors can be combined and ...
# subsetting maintains and updates attributes
c(x, !x[1:3])
#> <binary[7]>
#> [1] 1 1 1 0 0 0 0
#> Proportion = 0.429
# But ...
c(x, v_binary(context = context(purpose = purpose(study_role = "other"))))
#> Error: All purpose elements must be equal in order to combine stypes.
The following table describes the proposed data types (not all of these may be available at this time). A – indicates that the type inherits properties from the level above..
v_<type> |
prototype | support |
---|---|---|
v_binary |
logical |
\(\{0, 1\}\) |
v_count |
integer |
\((0, 1, 2, \dots)\) |
v_continuous |
double |
\(\mathcal{R}\) |
v_continuous_nonneg |
double |
\(\mathcal{R}^+\) |
v_nominal |
factor |
|
v_ordered |
ordered |
|
v_proportion |
double |
\([0, 1]\) |
tibble
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
#>
#> Attaching package: 'tibble'
#> The following object is masked from 'package:stype':
#>
#> view
n <- 100
make_context <- function(role){
context(purpose = purpose(study_role = role))
}
covariates <-
purrr::map(
.x = purrr::set_names(1:10, paste0("x", 1:10)),
.f = ~ v_binary(as.logical(rbinom(n, 1, 0.25)),
context = make_context("covariate"))
)
dt <- tibble(
y1 = v_binary(as.logical(rbinom(n, 1, 0.25)), context = make_context("outcome")),
y2 = v_continuous_nonneg(runif(n, 1, 100), context = make_context("outcome")),
y3 = v_continuous(rnorm(n), context = make_context("outcome")),
!!! covariates
)
dt
#> # A tibble: 100 × 13
#> y1 y2 y3 x1 x2 x3 x4 x5 x6 x7 x8 x9
#> <bnry> <nneg> <cont> <bnry> <bnry> <bnry> <bnr> <bnr> <bnr> <bnr> <bnr> <bnr>
#> 1 1 20.8 0.838 0 0 1 1 0 0 0 0 0
#> 2 0 62.2 -0.291 0 0 0 1 0 0 1 0 0
#> 3 1 35.5 0.0536 0 1 0 0 0 1 0 0 0
#> 4 1 90.4 2 0 1 1 0 0 0 0 0 0
#> 5 0 18 1.6 0 0 1 0 1 0 0 1 1
#> 6 0 43.3 0.674 0 0 0 0 0 0 0 0 0
#> 7 0 28.9 -1.08 0 1 0 0 0 0 0 1 0
#> 8 0 17.7 0.147 0 0 0 0 1 0 0 1 0
#> 9 1 63.8 0.432 1 1 0 0 0 0 0 1 0
#> 10 0 77.1 -1.04 0 0 0 1 0 1 1 0 0
#> # … with 90 more rows, and 1 more variable: x10 <bnry>
Selecting columns based on data type:
dt %>% select_if(is_binary)
#> # A tibble: 100 × 11
#> y1 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
#> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry> <bnry>
#> 1 1 0 0 1 1 0 0 0 0 0 0
#> 2 0 0 0 0 1 0 0 1 0 0 0
#> 3 1 0 1 0 0 0 1 0 0 0 0
#> 4 1 0 1 1 0 0 0 0 0 0 1
#> 5 0 0 0 1 0 1 0 0 1 1 0
#> 6 0 0 0 0 0 0 0 0 0 0 1
#> 7 0 0 1 0 0 0 0 0 1 0 1
#> 8 0 0 0 0 0 1 0 0 1 0 0
#> 9 1 1 1 0 0 0 0 0 1 0 0
#> 10 0 0 0 0 1 0 1 1 0 0 1
#> # … with 90 more rows
Selecting columns based on context: