stype
(pronounced stipe) is an R package for statistical
data types. It depends heavily upon the vctrs
package
to:
The stype
package provides classes that enforce
(run-time) safety for types common to many statistical analyses such as
v_binary
, v_continuous
, v_count
,
and v_nominal
. For example, binary data can be represented
in R
in at least three ways: a logical
, a
factor
with two levels, or a numeric
using
just 0
and 1
. Which representation should one
use? The latter two do not guarantee that certain binary operations are
closed in a mathematical sense; e.g., c(0, 1, 0, 1) + 1:4
returns c(1, 3, 3, 5)
. Such behavior is not possible with
v_binary
. Similarly, count data can be represented by an
integer
in R
but without the restriction of
being non-negative. The v_count
constructor enforces
positivity.
Each instance of stype
objects contain 2 attributes that
users may find useful: context
and
data_summary
. A context
can be used to specify
project-specific metadata. It is an S4
object containing
slots such as short_label
, long_label
,
description
, security_type
, tags
,
and purpose
. A purpose
, for example, can be
used to define a variable’s role in a study design such as “outcome”,
“identifier”, “covariate”, or “exposure”. This kind of contextual
information is invaluable in data pipelines.
A stype
vector also contains a data_summary
object, which is automatically generated and contain summary
statistics about the data. All objects contain the following
statistics:
n
: number of observationshas_missing
: an indicator of whether the variable has
missing datan_nonmissing
: the number of nonmissingn_missing
: the number of missingproportion_missing
: the proportion missingis_constant
: an indicator of whether all the values are
the sameEach type has additional summary statistics relevant to its data. For
example, v_continuous
contains the mean, standard
deviation, min, max, and various quantiles. The
data_summary
is updated whenever a variable is subset or
two vectors of the type are combined.
The package also prints certain attributes, for example:
> stype::v_binary(c(TRUE, FALSE, TRUE))
<binary[3]>
[1] 1 0 1
Proportion = 0.667
> stype::v_binary(c(TRUE, FALSE, TRUE, NA))
<binary[4]>
[1] 1 0 1 NA
Proportion = 0.667; Missing = 1.000