Event Data Theory

Table of Contents

What is an event
- Event Contexts
Marshalling event data
Defining new models
- Typeclasses for component types
- Testing models

The event-data-theory package of asclepias provides the types and functions for defining models of event data. The terms "theory" and "model" are borrowed from the notion of a Lawvere theory. ^[1]

Having a basic understanding how to read Haskell’s types will be useful in reading this document. Online resources such as Real World Haskell are excellent for learning.

What is an event

Abstractly, we define an event as an object which contains information about when something happened and what something happened. Concretely, an Event is a wrapper around the interval-algebra package’s PairedInterval type:

newtype Event c m a = MkEvent ( PairedInterval (Context c m) a )

The what part of the pair is a Context d c, which is described further below. The when part is an Interval a. The Interval type is described further in the interval-algebra documentation. You can find more information about intervals and how to use them there. Since an Event is an instance of the Intervallic typeclass, almost anything you can do with Interval types you can also do with Event types.

Event Contexts

An event’s Context contains three types of information:

data Context c m = MkContext
  { -- | the 'Concepts' of a @Context@
    getConcepts :: Concepts c (1)
    -- | the facts of a @Context@.  
  , getFacts    :: m (2)
    -- | the 'Source' of @Context@
  , getSource   :: Maybe Source (3)
  }

1	a set of `Concepts`, or tags, which can be used to identify events in a collection;
2	facts about the event whose shape and possible values are determined by the schema type `d`;
3	(optionally) data about the provenance of the event in a `Source` object.

Filling in and making the type parameters d and c concrete is what creates a new event model. The Concepts type c will generally be an ennumerated set of tag variants (such as data MyProjectTags = Diabetes | BirthDay | InHospital | ...) or simply Text. Ennumerated tags are preferred over Text as users then have some type safety around concepts. One cannot misspell a concept or use an undefined concept, for example.

The type parameter d provides a high degree of flexibility in defining new event models. The d type represents the schema, or shape, of an event’s data and can be a nearly arbitrary type composed of sum and product types. Often, the d type will be a sum type of "domains" where each domain is a group of facts relevant to a given domain. The schema of NoviSci’s standard EDM is organized around this idea.

Marshalling event data

Events are generally produced by some process outside of asclepias that extracts and transforms a data source into a sequence of events. NoviSci’s standard EDM represents event data (plus additional extra information sometimes used in other applications) as a JSON array, where each line in a file is a valid EventLine. The event-data-theory EventLine type corresponds to this JSON array and is used as the primary way of marshalling data into an Event.

The EventDataTheory.EventLines module provides several utilities for decoding events from eventlines. The parseEventLinesL function, for example, converts a ByteString of new-line delimed JSON into a pair of [String] (containing any parse error messages) and [(SubjectID, Event c m a)], a list of Subject ID/event pairs.

Defining new models

New event models are defined by providing concrete types for Event c m a (especially d and c), as in this example from the package’s test suite:

data SillySchema =
    A Int
  | B Text
  | C
  | D
  deriving (Show, Eq, Generic, Data)

instance FromJSON SillySchema where
  parseJSON = genericParseJSON
    (defaultOptions
      { sumEncoding = TaggedObject { tagFieldName      = "domain"
                                   , contentsFieldName = "facts"
                                   }
      }
    )

type SillyEvent1 a = Event Text SillySchema a

The SillyEvent type is a synonym for an Event where the concepts are Text, the facts are of shape SillySchema, and the Interval type is any valid type a.

Typeclasses for component types

The schema (d) type for an Event must an instance of Eq, Show, Generic, and FromJSON typeclasses. The DeriveGeneric language extension makes deriving the Generic instance trivial, as in the code above. At this time, users do need to provide the FromJSON instance, and the boilerplate in the example above should work in most cases.

The concept (c) type for an Event must an instance of Eq, Show, Typeable, and FromJSON typeclasses. Making c Generic will also make it Typeable, so in most cases simply deriving (Eq, Show, Generic) and a stock FromJSON instance is sufficient for the concept type.

Testing models

The event-data-theory packages provides a few utilities for testing a new model. These can be found in the EventDataTheory.Test module, which is not included in the main set of exported modules.

The eventDecodeTests and eventDecodeFailTests functions, for example, test for successful parsing and successful failed parsing (respectively) of EventLine d c a into the corresponding Event c m a. These functions take a directory path as an argument. Each file ending .jsonl in that directory should contain a single EventLine as JSON to be tested. See the test directory and EventDataTheory.TheoryTest module in this package for examples.

1. We use the terms informally to give the sense that a model of events is an instance of the theory. We have not checked that the event data theory actually is a universal algebra.