Theory and Design of asclepias
asclepias
is organized around three abstractions:
events, features and cohorts.
Event Data Theory
asclepias
provides the types and functions
for defining models of event data.
The terms "theory" and "model" are borrowed from the notion of a
Lawvere theory.
[1]
Definitions
Event
An Event
is a Context
(what happened) with an associated time interval (when it happened).
Concretely, an Event
is a wrapper around the interval-algebra
package’s
PairedInterval
type:
newtype Event t m a = MkEvent ( PairedInterval (Context t m) a )
Context
A Context
contains up to three types of information:
-
A tag set (required)
-
Facts about the event (required)
-
Metadata on the source of the event (optional)
A tag is a set of labels that give meaning to the events of interest. For example, "diabetes diagnosis", "birth day", "in hospital" are all possible tags, that together might define the study tag set.
An example of a context is below.
data Context t m = MkContext
{ -- | the 'TagSet' of a @Context@
getTagSet :: TagSet t (1)
-- | the facts of a @Context@.
, getFacts :: m (2)
-- | the 'Source' of @Context@
, getSource :: Maybe Source (3)
}
1 | a set of TagSet , or labels,
which can be used to identify events in a collection; |
2 | facts about the event whose shape and possible values
are determined by the schema type m ; |
3 | (optionally) data about the provenance of the event in a Source object. |
Facts
Facts are the data of interest for a particular event. The schema of the facts data is dynamic and is passed to the object as a parameter.
Event Model
Passing in specific parameters m
and c
to Context
creates a new event model.
An example of an event model is below.
data SillySchema =
A Int
| B Text
| C
| D
deriving (Show, Eq, Generic, Data)
instance FromJSON SillySchema where
parseJSON = genericParseJSON
(defaultOptions
{ sumEncoding = TaggedObject { tagFieldName = "domain"
, contentsFieldName = "facts"
}
}
)
type SillyEvent1 a = Event Text SillySchema a
The SillyEvent
type is a synonym for an Event
where
the tag set is Text
,
the facts are of shape SillySchema
,
and the Interval
type is any valid type a
.
The type parameter m
provides
a high degree of flexibility in defining new event models.
The m
type represents the schema, or shape,
of an event’s data and
can be a nearly arbitrary type
composed of sum and product types.
Often, the m
type will be a sum type of "domains"
where each domain is a group of facts relevant to a given domain.
The schema of NoviSci’s standard
EDM
is organized around this idea.
Design of the Features module
A Feature
is a type parametrized by two types, name
and d
:
newtype (KnownSymbol name) => Feature name d =
MkFeature ( FeatureData d )
The type d
here stands for "data",
which then parametrizes the FeatureData
type.
The FeatureData
type is wrapper around an
Either
:
newtype FeatureData d = MkFeatureData {
getFeatureData :: Either MissingReason d -- ^ Unwrap FeatureData.
}
Type of d
can be almost anything
and need not be a scalar.
All the following are valid types for d
:
-
Int
-
Text
-
(Int, Maybe Text)
-
[Double]
The name
type is a bit special:
it does not appear on the right-hand side of the =
.
In type-theory parlance,
name
is a
phantom type.
So, a Feature
type constructor takes two arguments (name
and d
),
but its value constructor (MkFeature
)
takes a single value of type FeatureData d
.
Values of the FeatureData
type contain
the data we’re ultimately interested in analyzing
or passing along to downstream applications.
However,
a FeatureData
value does not simply contain data of type d
.
The type allows for the possibility of
missingness, failures, or errors
via the
Either
type.
A content of a FeatureData d
, then, is either
a Left MissingReason
or a
Right d
.
The use of Either
has important implications when defining features,
as we will see.
Now that we know the internals of a Feature
,
how do we create them?
There are two ways to create features:
-
a
pure
lifting of data into aFeature
or -
writing a
Definition
: a function that defines aFeature
based on otherFeature
s.
The first method is a way to get data directly into a Feature
.
The following function takes a list of Events
and
makes a Feature
of them:
allEvents :: [Event Day] -> Feature "allEvents" [Event Day]
allEvents = pure
The pure
lifting is generally used to lift a subject’s input data into a Feature
,
so that other features can be defined from a subject’s data.
Feature`s are
derived from other `Feature
by the Definition
type
Specifically,
Definition
is a type containing a function that maps Feature
inputs
to a Feature
output. define
(or defineA
) constructs the Definition
.
For example:
myDef :: Definition (Feature "a" Int -> Feature "b" Bool)
myDef = define (\x -> if x > 0 then True else False)
x
is type Int
not Feature "a" Int
and the return type
is Bool
not Feature "b" Bool
.
The define
function and Definition
type
do the magic of lifting these types to the Feature
level.
To see this more clearly,
see myDef2
below:
intToBool :: Int -> Bool
intToBool x = if x > 0 then True else False)
myDef2 :: Definition (Feature "a" Int -> Feature "b" Bool)
myDef2 = define intToBoo
myDef2
is equivalent to myDef
.
The define
function, then,
let’s us focus on the logic of our Feature
without needing to worry about handling the error cases.
If we were to write a function
with signature Feature "a" Int → Feature "b" Bool
directly,
it would look something like:
myFeat :: Feature "a" Int -> Feature "b" Bool
myFeat (MkFeature (MkFeatureData (Left r))) = MkFeature (MkFeatureData (Left r))
myFeat (MkFeature (MkFeatureData (Right x))) = MkFeature (MkFeatureData (Right $ intToBool x))
One would need to pattern match all the possible types of inputs, which gets more complicated as the number of inputs increases.
As an aside,
since Feature
are
Functors,
one could instead write:
myFeat :: Feature "a" Int -> Feature "b" Bool
myFeat = fmap intToBool
This would require understanding how Functors and similar structures are used.
The define
and defineA
functions provide a common interface
to these structures without needing to understand the details.
Evaluating Definitions
To evaluate a Definition
, we use the eval
function.
Consider the following example.
The input data is a list of Int`s. If the list is empty (`null
),
this is considered an error in feat1
.
If the list has more than 3 elements, then in feat2
,
the sum is computed; otherwise 0
is returned.
featInts :: [Int] -> Feature "someInts" [Int]
featInts = pure
feat1 :: Definition (Feature "someInts" [Int] -> Feature "hasMoreThan3" Bool)
feat1 = defineA
(\ints -> if null ints then makeFeature (missingBecause $ Other "no data")
else makeFeature $ featureDataR (length ints > 3))
feat2 :: Definition (
Feature "hasMoreThan3" Bool
-> Feature "someInts" [Int]
-> Feature "sum" Int)
feat2 = define (\b ints -> if b then sum ints else 0)
ex0 = featInts []
ex0a = eval feat1 ex0 -- MkFeature (MkFeatureData (Left (Other "no data")))
ex0b = eval feat2 (ex0a, ex0) -- MkFeature (MkFeatureData (Left (Other "no data")))
ex1 = featInts [3, 8]
ex1a = eval feat1 ex1 -- MkFeature (MkFeatureData (Right False))
ex1b = eval feat2 (ex1a, ex1) -- MkFeature (MkFeatureData (Right 0))
ex2 = featInts [1..4]
ex2a = eval feat1 ex2 -- MkFeature (MkFeatureData (Right True))
ex2b = eval feat2 (ex2a, ex2) -- MkFeature (MkFeatureData (Right 10))
Note the value of ex0b
.
It is a Left
because the value of ex0a
is a Left
;
in other words, errors propogate along Feature
.
If a given Feature
dependency is a Left
then
that Feature
will also be Left
.
A Feature’s internal `Either
structure
is a way to prevent downstream
dependencies from needing to be computed,
which increases performance.
Type Safety of Features
In describing the Feature
type,
the utility of having the name as a type may not have been clear.
To clarify, consider the following example:
x :: Feature "someInt" Natural
x = pure 39
y :: Feature "age" Natural
y = pure 43
f :: Definition (Feature "age" Natural -> Feature "isOld" Bool)
f = define (>= 39)
fail = eval f x
pass = eval f y
In the example, fail
does not compile because "someInt"
is not "age"
,
even though both the data type are Natural
.