Motivation

A common task in data science pipelines is to summarize a dataset by subgroups. For example, consider everyone’s (least) favorite dataset mtcars:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

This dataset contains two binary variables by which one may be interested in grouping the data: vs and am. Two binary variables present 3 grouping (for \(2^3 = 8\) possible groups total) for summarizing the data: marginally by the two groups within each variable, or jointly by the 4 groups defined by the cross of the two variables.

One way to obtain these groups is to use dplyr::group_by. By way of example, say we are interested in the mean miles per gallon within each of these groups:

summarize_mpg <- function(dt) summarize(dt, mean(mpg))

mtcars %>%
  group_by(vs) %>%
  summarize_mpg
## # A tibble: 2 × 2
##      vs `mean(mpg)`
##   <dbl>       <dbl>
## 1     0        16.6
## 2     1        24.6
mtcars %>%
  group_by(am) %>%
  summarize_mpg
## # A tibble: 2 × 2
##      am `mean(mpg)`
##   <dbl>       <dbl>
## 1     0        17.1
## 2     1        24.4
mtcars %>%
  group_by(vs, am) %>%
  summarize_mpg
## `summarise()` has grouped output by 'vs'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups:   vs [2]
##      vs    am `mean(mpg)`
##   <dbl> <dbl>       <dbl>
## 1     0     0        15.0
## 2     0     1        19.8
## 3     1     0        20.7
## 4     1     1        28.4

That works, but as the number of grouping variables increases, this approach becomes untenable. gofl takes a different approach. The following gofl formula specifies groupings for each of the margins and the joint:

mt_groups <- ~ vs + am + vs:am
# or equivalently
mt_groups <- ~ vs*am

A gofl specification is two parts: a formula defining the groupings and a list defining levels of the variables in the formula. At this time, the list of levels needs to be created by hand. Future versions of gofl may provide a means to automate this step.

dat <- list(
 vs = sort(unique(mtcars$vs)), # The sorting isn't strictly necessary.
 am = sort(unique(mtcars$am))
)

With these two pieces of data the grouping specification can be created:

grps <- create_groupings(formula = mt_groups, data = dat)

The result of create_groupings is a list with three elements: data, groupings, index_fcn. The data element is a just copy of the data argument. The groupings element is a list where each element is a specification for each group defined by the formula. In our example, the first group defines the subset containing vs == 0:

str(grps$groupings[[1]])
## List of 4
##  $ i   : chr "1-0"
##  $ q   :List of 1
##   ..$ : language ~vs == 0
##   .. ..- attr(*, ".Environment")=<environment: 0x562c814f5508> 
##  $ g   :List of 1
##   ..$ vs: chr "0"
##  $ tags: NULL

A single group’s specification has 4 parts:

  • i: the index of the group (described below)
  • q: a quosure that can be used to carry out the subgrouping in (e.g.) dplyr::filter
  • g: a list that specifies which variables and levels define the group
  • tags: a character vector of tags (See tags section)

Here’s another example of a grouping; in this case it is the subset where vs == 1 and am == 0:

str(grps$groupings[[7]])
## List of 4
##  $ i   : chr "2-1"
##  $ q   :List of 2
##   ..$ : language ~vs == 1
##   .. ..- attr(*, ".Environment")=<environment: 0x562c814f6c80> 
##   ..$ : language ~am == 0
##   .. ..- attr(*, ".Environment")=<environment: 0x562c814f9150> 
##  $ g   :List of 2
##   ..$ vs: chr "1"
##   ..$ am: chr "0"
##  $ tags: NULL

Using grouping soecification to obtain summaries

With the groupings now defined we can now create a pipeline for summarizing within each group.

purrr::map_dfr(
  .x = grps$groupings,
  .f = ~ {
    mtcars %>% 
    filter(!!! .x$q) %>% 
    summarize_mpg() %>%
    bind_cols(tibble(!!! .x$g))
  }
)
##   mean(mpg)   vs   am
## 1  16.61667    0 <NA>
## 2  24.55714    1 <NA>
## 3  17.14737 <NA>    0
## 4  24.39231 <NA>    1
## 5  15.05000    0    0
## 6  19.75000    0    1
## 7  20.74286    1    0
## 8  28.37143    1    1

That’s the basic idea of gofl. From the patterns described above you can specify much more complicated groupings.

Indices

Each element of a grouping is even a unique index as follows:

  • each variable is given a positional index in the pattern of v1-v2-v3-...-vn, where the order is the order of the variables given in the create_groupings data argument. In the example, above vs corresponds to the first position (v1) and am to the second (v2).
  • each level of a variable is given a positive integer value. In the example, vs = 0 corresponds to 1, vs = 1 corresponds to 2. A group not involving vs corresponds to 0.

The index 1-0 in our example is the group defined by vs == 0; 0-2 is the group defined by am == 1; 1-2 is defined by vs == 0 & am == 1.

The index_fcn part of a create_groupings object allows the user to look up indices.

grps$index_fcn(am = 0)
## [1] "0-1"
grps$index_fcn(am = 0, vs = 1)
## [1] "2-1"
grps$index_fcn(vs = 1)
## [1] "2-0"

Indices are useful when the number of groups is large and you need a way to quickly find a particular group.

Tagging

In real applications, you may want downstream processes to apply different functions to different groups. The tag function allows you to tag particular groups by arbitrary character vectors. Let’s make our example above even more unrealistic but nonetheless illustrative and take the mean in the marginal groups and the median in joint group.

First, we tag the grouping as appropriate:

mt_groups <- ~ tag(vs + am, "marginal") + tag(vs:am, "joint")
grps <- create_groupings(formula = mt_groups, data = dat)

Note that the tag element is now populated:

grps$groupings[c(1,5)]
## [[1]]
## [[1]]$i
## [1] "1-0"
## 
## [[1]]$q
## [[1]]$q[[1]]
## <quosure>
## expr: ^vs == 0
## env:  0x562c80b46290
## 
## 
## [[1]]$g
## [[1]]$g$vs
## [1] "0"
## 
## 
## [[1]]$tags
## [1] "marginal"
## 
## 
## [[2]]
## [[2]]$i
## [1] "1-1"
## 
## [[2]]$q
## [[2]]$q[[1]]
## <quosure>
## expr: ^vs == 0
## env:  0x562c80b46290
## 
## [[2]]$q[[2]]
## <quosure>
## expr: ^am == 0
## env:  0x562c80b4ac30
## 
## 
## [[2]]$g
## [[2]]$g$vs
## [1] "0"
## 
## [[2]]$g$am
## [1] "0"
## 
## 
## [[2]]$tags
## [1] "joint"

An example pipeline taking advantage of the tag may look like:

summarize_mpg2 <- function(dt, tag){
  avg <- switch(tag, "marginal" = mean, "joint" = median)
  summarize(dt, avg(mpg))
}

purrr::map_dfr(
  .x = grps$groupings,
  .f = ~ {
    mtcars %>% 
    filter(!!! .x$q) %>% 
    summarize_mpg2(.x$tag) %>%
    bind_cols(tibble(!!! .x$g))
  }
)
##   avg(mpg)   vs   am
## 1 16.61667    0 <NA>
## 2 24.55714    1 <NA>
## 3 17.14737 <NA>    0
## 4 24.39231 <NA>    1
## 5 15.20000    0    0
## 6 20.35000    0    1
## 7 21.40000    1    0
## 8 30.40000    1    1