Cohort Collector
The cohort-collector
application provides a utility for combining cohorts
built from the same cohort specification.
A common use case is combining cohort data created on different partitions of event data.
The application takes a single text file containing one filepath per line as input.
The file can be located either on the local file system or on S3.
The latest linux executable can be downloaded from:
If you have access to our container registry, a minimal docker image containing the application is also available:
docker pull registry.novisci.com/nsstat/asclepias/cohort-collector:latest
Usage
cohort collector
Usage: cohort-collector ([-d|--dir DIRECTORY] (-f|--file INPUT) |
(-b|--bucket Bucket) (-m|--manifest KEY))
[[--outdir DIRECTORY] (-o|--output OUTPUT) |
[--outregion OUTREGION] --outbucket OUTBUCKET
--outkey OUTPUTKEY] [-d|--decompress] [-z|--gzip]
Collects cohorts run on different input data. The cohorts must be derived from
the same cohort specification or results may be weird. Supports reading data
from a local directory or from S3. In either case the input is a path to a
file containing paths (or S3 keys) to each cohort part, where One line = one
file. S3 capabilities are currently limited (e.g. AWS region is set to N.
Virginia).Data can be output to stdout (default), to a file (using the -o
option), or to S3 (using the --outbucket and --outkey options).
Available options:
-d,--dir DIRECTORY optional directory
-f,--file INPUT Input file
-b,--bucket Bucket S3 bucket
-m,--manifest KEY S3 manifest file
--outdir DIRECTORY optional output directory
-o,--output OUTPUT Output file
--outregion OUTREGION output AWS Region
--outbucket OUTBUCKET output S3 bucket
--outkey OUTPUTKEY output S3 location
-d,--decompress decompress gzipped input
-z,--gzip compress output using gzip
-h,--help Show this help text
Example usage
Suppose we have 3 cohort json files we want to combine.
testcw1.json
{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["a",["2010-07-06","2010-07-07"]],["b",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}
testcw2.json
{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["d",["2010-07-06","2010-07-07"]],["e",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}
testcw3.json
{"example":[{"attritionInfo":[[{"tag":"SubjectHasNoIndex"},0],[{"contents":[1,"dummy"],"tag":"ExcludedBy"},0],[{"tag":"Included"},2]],"totalSubjectsProcessed":2,"totalUnitsProcessed":2},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5],[true,true]],"ids":[["f",["2010-07-06","2010-07-07"]],["g",["2010-07-06","2010-07-07"]]]},"tag":"CW"}]}
To combine these files, we provide a manifest file:
manifest.txt
test/tests/testcw1.json
test/tests/testcw2.json
test/tests/testcw3.json
Then the cohort-collector
app can be run from the same directory as the location of the manifest file:
$ cohort-collector -f manifestcw.txt
{"example":[{"attritionInfo":[{"attritionCount":2,"attritionLevel":{"contents":[1,"dummy"],"tag":"ExcludedBy"}},{"attritionCount":5,"attritionLevel":{"tag":"Included"}}],"totalProcessed":7},{"contents":{"attributes":[{"attrs":{"getDerivation":"","getLongLabel":"another label","getPurpose":{"getRole":["Outcome"],"getTags":[]},"getShortLabel":"somelabel"},"name":"myVar1","type":"Count"},{"attrs":{"getDerivation":"","getLongLabel":"","getPurpose":{"getRole":[],"getTags":[]},"getShortLabel":""},"name":"myVar2","type":"Bool"}],"cohortData":[[5,5,10,99,86],[true,true,false,true,true]],"ids":["a","b","c","d","f"]},"tag":"CW"}]}