Report Data Format

Dynamic Reports on the NoviSci Web Platform are built using a web based creation tool. They are composed of pages with data visualizations that provide some level of data exploration. Viewers of the report can use a “control bar” to select stratification and filter options. The data that is imported into the platform must be structured to best support these stratification and filter options.

Data Sets

A data set for dynamic reports is a small-ish rectangular block of data that is used to render one or more visualizations on a report page.

At a high level, the columns of this data set will represent either subgroups (ie, race, age group, sex, etc.) or values to be plotted/displayed (count, mean, percent, etc).

The rectangular block of data contains values for all of the combinations of all of the subgroups.

Simple data set

Say you have a risk value for year between 2014 and 2018.

risk_data = read.csv('./report_data_format_risk.csv')
risk_data
#>         year      risk
#> 1 2014-01-01 0.1218669
#> 2 2015-01-01 0.1081349
#> 3 2016-01-01 0.1704000
#> 4 2017-01-01 0.1162708
#> 5 2018-01-01 0.1099495

This can be plotted on a line graph with year as the x and risk as the y.

Adding region

Now I want to see the same data, but stratified by region.

I would need to add a column for region and rows for every combination of region and year (including where region is NA). We started with 5 rows - by adding 4 regions + NA we now have 25 rows.

risk_data = read.csv('./report_data_format_risk_region.csv')
risk_data
#>    region year       risk
#> 1    <NA> 2014 0.12186692
#> 2    <NA> 2015 0.10813493
#> 3    <NA> 2016 0.17040004
#> 4    <NA> 2017 0.11627084
#> 5    <NA> 2018 0.10994955
#> 6    East 2014 0.10930921
#> 7    East 2015 0.16191591
#> 8    East 2016 0.10451461
#> 9    East 2017 0.11816492
#> 10   East 2018 0.11627084
#> 11  North 2014 0.09420831
#> 12  North 2015 0.14612402
#> 13  North 2016 0.10394989
#> 14  North 2017 0.10994955
#> 15  North 2018 0.12186692
#> 16  South 2014 0.10098599
#> 17  South 2015 0.12787712
#> 18  South 2016 0.09283255
#> 19  South 2017 0.10348196
#> 20  South 2018 0.10813493
#> 21   West 2014 0.14448399
#> 22   West 2015 0.20498579
#> 23   West 2016 0.16926869
#> 24   West 2017 0.17071404
#> 25   West 2018 0.17040004

You can see how the same would apply for adding another subgroup like sex. We add 2 sex + NA and we now have 75 rows.

Granularity of data sets

You can imagine one large data set for a whole report where we cross all the different outcome variables by all demographic subgroups, etc. For the purposes of importing into the platform we need to break that down into smaller data sets.

How small do we make them? As a rough guide we should aim for one per chart. This makes it easier for someone creating the report and for testing (we update one data set and test the one chart).

That said, if we have two charts - one showing overall values and one showing overall with the ability to stratify by region, it would be reasonable to have one data set and point the two charts at the one data set.

Data values & formats

The goal of setting these formats and list of standard values is to set some consistent, default value mappings while also allowing the person building the report to override label values if a specific label is needed for a report.

Dates

If a value in the data is expected to be used in plotting time series data or in an interactive slider with quarterly or monthly time points it should be passed as a string in ISO8601 format YYYY-MM-DD if just a date with no time or YYYY-MM-DDTHH:mm:ssZ if time needs to be specified.

Note: This includes “quarterly” values, specifying the first day of the quarter: 2020-01-01, 2020-04-01, 2020-07-01, 2020-10-01, etc.

Exception: If the value is only at the year granularity, this could be passed as a number instead of a date string.

Proportion (not Percent)

Values representing a proportion should be sent as a decimal value (0.1 not 10%). There will be formatting applied in the UI (multiplying by 100 and adding the % along with desired rounding).

Boolean values

Boolean values should be passed as TRUE or FALSE - which should result in TRUE and FALSE values in the resulting CSV file.

Standard Value Lists

More of these may be added over time if we identify lists that can have standard values.

Sex

Values: * M for male * F for female * U for unknown

Neil Harding

2023-09-13