Theory and Design of asclepias
asclepias is organized around three abstractions:
events, features and cohorts.
Event Data Theory
asclepias provides the types and functions for defining models of event
data. The types and terminology are intended to be consistent with the
event-data-model project.
Definitions
Event
An Event is a Context with an associated time interval.
Concretely, an Event is a wrapper around the interval-algebra package’s
PairedInterval type:
newtype Event t m a= MkEvent ( PairedInterval (Context t m) a )
Context
A Context contains up to three types of information:
-
A
TagSet(required) -
Facts about the event (required) -
Metadata on the source of the event (optional)
An example of a context is below.
data Context t m = MkContext
{ -- | the 'TagSet' of a @Context@
getTagSet :: TagSet t (1)
-- | the facts of a @Context@.
, getFacts :: m (2)
-- | the 'Source' of @Context@
, getSource :: Maybe Source (3)
}
| 1 | a TagSet; |
| 2 | Fact s about the event whose shape and possible values
are determined by the Model type m; |
| 3 | optionally, information on the source of the event data. |
Tags
A TagSet is a collection labels ("tags") that summarize the information in the data Model, as given by the Fact s of that model attached to a particular event. Each Event has an associated TagSet, which can contain multiple unique tags.
A Fact is a collection of metadata about an Event, which serves to classify for the purpose of building study cohorts and analyses. A Model for a given study defines which particular Fact s are allowed to bee used in that study.
Fact s can contain a lot of information. For example, a Fact to denote a medical insurance claim could contain procedure codes, cost information and provider information.
The tags in a TagSet synthesize the Fact s associated with an Event to provide a kind of short-hand, useful when programming for or reasoning about the groups of Event s relevant a study.
For example, "diabetes_treatment" and "in_hospital" are possible tags that could be used to summarize a Fact with medical claim information. A tag might synthesize information across multiple Fact s in a single Model by, say, combining claims information with an ICD9 code to label an Event as "diabetes_diagnosis".
At present, tags are created by manually specifying how the Fact s of a Model should be summarized, typically using the helper functions in the notionate package in a project-specific concepts.dhall file.
Facts and Model
See the event-data-model documentation for a definition of Facts and Model. Note that Fact and Model defined there are types defined in the Dhall programming language. The programmer using asclepias must manually define in Haskell code a type m within the Context c m that corresponds to the Model for a given project.
Source
The Source type stores information on the origins of the data.
This allows for increased traceability of the data from source to final analysis.
A Source type has the following definition:
---
data Source = MkSource
{ column :: Maybe T.Text
, file :: Maybe T.Text
, row :: Maybe Integer
, table :: T.Text
, database :: T.Text
}
deriving (Eq, Show, Generic)
---
column is the column name of the source data.
file is the path of the source data.
row is the row number of the source data.
table is the name of the source data table.
database refers to the name of the data source.
For example "Optum" or "Medicaid"
Example
Here is an example of an Event:
data SillySchema =
A Int
| B Text
| C
| D
deriving (Show, Eq, Generic, Data)
instance FromJSON SillySchema
type SillyEvent1 a = Event Text SillySchema a
The SillyEvent type is a project-specific synonym for an Event where
the TagSet is Text.
SillySchema is the Model, a Haskell sum type, with a different Fact given by each of its possible value types
and the Interval type is any valid type a, such as Int.
This structure provides a high degree of flexibility in defining new structures for study-specific cohort definitions and analysis.
See the
event-data-model
documentation for details.
Design of the Features module
A Feature is a type parametrized by two types, name and d:
newtype (KnownSymbol name) => Feature name d =
MkFeature ( FeatureData d )
The type d here stands for "data",
which then parametrizes the FeatureData type.
The FeatureData type is wrapper around an
Either:
newtype FeatureData d = MkFeatureData {
getFeatureData :: Either FeatureProblemFlag d -- ^ Unwrap FeatureData.
}
Type of d can be almost anything
and need not be a scalar.
All the following are valid types for d:
-
Int -
Text -
(Int, Maybe Text) -
[Double]
The name type is a bit special:
it does not appear on the right-hand side of the =.
In type-theory parlance,
name is a
phantom type.
So, a Feature type constructor takes two arguments (name and d),
but its value constructor (MkFeature)
takes a single value of type FeatureData d.
Values of the FeatureData type contain
the data we’re ultimately interested in analyzing
or passing along to downstream applications.
However,
a FeatureData value does not simply contain data of type d.
The type allows for the possibility of
missingness, failures, or errors
via the
Either type.
A content of a FeatureData d, then, is either
a Left FeatureProblemFlag or a
Right d.
The use of Either has important implications when defining features,
as we will see.
Now that we know the internals of a Feature,
how do we create them?
There are two ways to create features:
-
a
purelifting of data into aFeatureor -
writing a
Definition: a function that defines aFeaturebased on otherFeatures.
The first method is a way to get data directly into a Feature.
The following function takes a list of Events and
makes a Feature of them:
allEvents :: [Event Day] -> Feature "allEvents" [Event Day]
allEvents = pure
The pure lifting is generally used to lift a subject’s input data into a Feature,
so that other features can be defined from a subject’s data.
Feature`s are
derived from other `Feature by the Definition type
Specifically,
Definition is a type containing a function that maps Feature inputs
to a Feature output. define (or defineA) constructs the Definition.
For example:
myDef :: Definition (Feature "a" Int -> Feature "b" Bool)
myDef = define (\x -> if x > 0 then True else False)
x is type Int not Feature "a" Int and the return type
is Bool not Feature "b" Bool.
The define function and Definition type
do the magic of lifting these types to the Feature level.
To see this more clearly,
see myDef2 below:
intToBool :: Int -> Bool
intToBool x = if x > 0 then True else False)
myDef2 :: Definition (Feature "a" Int -> Feature "b" Bool)
myDef2 = define intToBoo
myDef2 is equivalent to myDef.
The define function, then,
let’s us focus on the logic of our Feature
without needing to worry about handling the error cases.
If we were to write a function
with signature Feature "a" Int → Feature "b" Bool directly,
it would look something like:
myFeat :: Feature "a" Int -> Feature "b" Bool
myFeat (MkFeature (MkFeatureData (Left r))) = MkFeature (MkFeatureData (Left r))
myFeat (MkFeature (MkFeatureData (Right x))) = MkFeature (MkFeatureData (Right $ intToBool x))
One would need to pattern match all the possible types of inputs, which gets more complicated as the number of inputs increases.
As an aside,
since Feature are
Functors,
one could instead write:
myFeat :: Feature "a" Int -> Feature "b" Bool
myFeat = fmap intToBool
This would require understanding how Functors and similar structures are used.
The define and defineA functions provide a common interface
to these structures without needing to understand the details.
Evaluating Definitions
To evaluate a Definition, we use the eval function.
Consider the following example.
The input data is a list of Int`s. If the list is empty (`null),
this is considered an error in feat1.
If the list has more than 3 elements, then in feat2,
the sum is computed; otherwise 0 is returned.
featInts :: [Int] -> Feature "someInts" [Int]
featInts = pure
feat1 :: Definition (Feature "someInts" [Int] -> Feature "hasMoreThan3" Bool)
feat1 = defineA
(\ints -> if null ints then makeFeature (missingBecause $ CustomFlag "no data")
else makeFeature $ featureDataR (length ints > 3))
feat2 :: Definition (
Feature "hasMoreThan3" Bool
-> Feature "someInts" [Int]
-> Feature "sum" Int)
feat2 = define (\b ints -> if b then sum ints else 0)
ex0 = featInts []
ex0a = eval feat1 ex0 -- MkFeature (MkFeatureData (Left (CustomFlag "no data")))
ex0b = eval feat2 (ex0a, ex0) -- MkFeature (MkFeatureData (Left (CustomFlag "no data")))
ex1 = featInts [3, 8]
ex1a = eval feat1 ex1 -- MkFeature (MkFeatureData (Right False))
ex1b = eval feat2 (ex1a, ex1) -- MkFeature (MkFeatureData (Right 0))
ex2 = featInts [1..4]
ex2a = eval feat1 ex2 -- MkFeature (MkFeatureData (Right True))
ex2b = eval feat2 (ex2a, ex2) -- MkFeature (MkFeatureData (Right 10))
Note the value of ex0b.
It is a Left because the value of ex0a is a Left;
in other words, errors propogate along Feature.
If a given Feature dependency is a Left then
that Feature will also be Left.
This propogation of Left s often increases performance.
In a sequence of computations,
after the first Left occurs,
no subsequent computations are needed.
Type Safety of Features
In describing the Feature type,
the utility of having the name as a type may not have been clear.
To clarify, consider the following example:
x :: Feature "someInt" Natural
x = pure 39
y :: Feature "age" Natural
y = pure 43
f :: Definition (Feature "age" Natural -> Feature "isOld" Bool)
f = define (>= 39)
fail = eval f x
pass = eval f y
In the example, fail does not compile because "someInt" is not "age",
even though both the data types are Natural.
If the fail line is uncommented, you will see a compiler error like this:
* Couldn't match type `"someInt"' with `"age"'
Expected type: Feature "age" Natural
Actual type: Feature "someInt" Natural
* In the second argument of `eval', namely `x'
In the expression: eval f x
In an equation for `fail': fail = eval f x
Commenting out fail will produce a pass value that looks like this when
printed,
"isOld": MkFeatureData {getFeatureData = Right True}
Make note of what the compiler error is telling you: The string "age" in
the type signature of f is actually part of the type of the input. The
compiler says Feature "age" Natural is not the same type as Feature
"someInt" Natural and therefore does not compile the program.
In other words, if you correctly declare that the input type of f must be a
Feature "age" Natural then you can only give f some input that has that
type, including the name "age".
Therefore, the name field of a Feature can serve as a tool to declare your
intention about which inputs should be paired with which functions via their
names.
You can easily abuse this aspect of Feature.
For example, this safety mechanism breaks down when you allow the compiler to choose any name it wants by setting the name to be a type variable rather than an explicit string.
The following version compiles just fine, with x, y as above.
f :: Definition (Feature n Natural -> Feature "isOld" Bool)
f = define (>= 39)
gotcha = eval f x
The only difference is that Feature n Natural has a type variable n where
"age" used to be. gotcha when printed produces
"isOld": MkFeatureData {getFeatureData = Right True}
If you do not wish to be intentional about using names as a means
to specify named inputs and outputs via the type system, it is likely not
worth using Feature at all.