Cohort Collector

The cohort-collector application provides a utility for combining cohorts built from the same cohort specification. A common use case is combining cohort data created on different partitions of event data. The application takes a single text file containing one filepath per line as input. The file can be located either on the local file system or on S3.

The latest linux executable can be downloaded from:

https://downloads.novisci.com/hasklepias/cohort-collector-0.20.3-x86_64-linux.tar.gz

If you have access to our container registry, a minimal docker image containing the application is also available:

docker pull registry.novisci.com/nsstat/asclepias/cohort-collector:latest

Usage

cohort collector

Usage: cohort-collector ([-d|--dir DIRECTORY] (-f|--file INPUT) |
                          (-b|--bucket Bucket) (-m|--manifest KEY))
                        [[--outdir DIRECTORY] (-o|--output OUTPUT) |
                          [--outregion OUTREGION] --outbucket OUTBUCKET
                          --outkey OUTPUTKEY] [-d|--decompress] [-z|--gzip]

  Collects cohorts run on different input data. The cohorts must be derived from
  the same cohort specification or results may be weird. Supports reading data
  from a local directory or from S3. In either case the input is a path to a
  file containing paths (or S3 keys) to each cohort part, where One line = one
  file. S3 capabilities are currently limited (e.g. AWS region is set to N.
  Virginia).Data can be output to stdout (default), to a file (using the -o
  option), or to S3 (using the --outbucket and --outkey options).

Available options:
  -d,--dir DIRECTORY       optional directory
  -f,--file INPUT          Input file
  -b,--bucket Bucket       S3 bucket
  -m,--manifest KEY        S3 manifest file
  --outdir DIRECTORY       optional output directory
  -o,--output OUTPUT       Output file
  --outregion OUTREGION    output AWS Region
  --outbucket OUTBUCKET    output S3 bucket
  --outkey OUTPUTKEY       output S3 location
  -d,--decompress          decompress gzipped input
  -z,--gzip                compress output using gzip
  -h,--help                Show this help text

Example usage

Suppose we have 3 cohort json files we want to combine.

testcw1.json

{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["a",["2010-07-06","2010-07-07"]],["b",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}

testcw2.json

{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["d",["2010-07-06","2010-07-07"]],["e",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}

testcw3.json

{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["f",["2010-07-06","2010-07-07"]],["g",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}

To combine these files, we provide a manifest file:

manifest.txt

test/tests/testcw1.json
test/tests/testcw2.json
test/tests/testcw3.json

Then the cohort-collector app can be run from the same directory as the location of the manifest file:

$ cohort-collector -f manifestcw.txt
{"example":[{"attritionInfo":[{"attritionCount":2,"attritionLevel":{"contents":[1,"dummy"],"tag":"ExcludedBy"}},{"attritionCount":5,"attritionLevel":{"tag":"Included"}}],"totalProcessed":7},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5,10,99,86],[true,true,false,true,true]],"ids":["a","b","c","d","f"]},"tag":"CW"}]}