A common task in data science pipelines is to summarize a dataset by
subgroups. For example, consider everyone’s (least) favorite dataset
mtcars
:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
This dataset contains two binary variables by which one may be
interested in grouping the data: vs
and am
.
Two binary variables present 3 grouping (for \(2^3 = 8\) possible groups total) for
summarizing the data: marginally by the two groups within each variable,
or jointly by the 4 groups defined by the cross of the two
variables.
One way to obtain these groups is to use
dplyr::group_by
. By way of example, say we are interested
in the mean miles per gallon within each of these groups:
## # A tibble: 2 × 2
## vs `mean(mpg)`
## <dbl> <dbl>
## 1 0 16.6
## 2 1 24.6
## # A tibble: 2 × 2
## am `mean(mpg)`
## <dbl> <dbl>
## 1 0 17.1
## 2 1 24.4
## `summarise()` has grouped output by 'vs'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups: vs [2]
## vs am `mean(mpg)`
## <dbl> <dbl> <dbl>
## 1 0 0 15.0
## 2 0 1 19.8
## 3 1 0 20.7
## 4 1 1 28.4
That works, but as the number of grouping variables increases, this
approach becomes untenable. gofl
takes a different
approach. The following gofl
formula specifies groupings
for each of the margins and the joint:
mt_groups <- ~ vs + am + vs:am
# or equivalently
mt_groups <- ~ vs*am
A gofl
specification is two parts: a
formula
defining the groupings and a list
defining levels of the variables in the formula. At this time, the
list
of levels needs to be created by hand. Future versions
of gofl
may provide a means to automate this step.
dat <- list(
vs = sort(unique(mtcars$vs)), # The sorting isn't strictly necessary.
am = sort(unique(mtcars$am))
)
With these two pieces of data the grouping specification can be created:
grps <- create_groupings(formula = mt_groups, data = dat)
The result of create_groupings
is a list with three
elements: data, groupings, index_fcn. The data
element is a
just copy of the data
argument. The groupings
element is a list
where each element is a specification for
each group defined by the formula. In our example, the first group
defines the subset containing vs == 0
:
str(grps$groupings[[1]])
## List of 4
## $ i : chr "1-0"
## $ q :List of 1
## ..$ : language ~vs == 0
## .. ..- attr(*, ".Environment")=<environment: 0x562c814f5508>
## $ g :List of 1
## ..$ vs: chr "0"
## $ tags: NULL
A single group’s specification has 4 parts:
i
: the index of the group (described below)q
: a quosure
that can be used to carry out
the subgrouping in (e.g.) dplyr::filter
g
: a list
that specifies which variables
and levels define the grouptags
: a character
vector of tags (See tags
section)Here’s another example of a grouping; in this case it is the subset
where vs == 1
and am == 0
:
str(grps$groupings[[7]])
## List of 4
## $ i : chr "2-1"
## $ q :List of 2
## ..$ : language ~vs == 1
## .. ..- attr(*, ".Environment")=<environment: 0x562c814f6c80>
## ..$ : language ~am == 0
## .. ..- attr(*, ".Environment")=<environment: 0x562c814f9150>
## $ g :List of 2
## ..$ vs: chr "1"
## ..$ am: chr "0"
## $ tags: NULL
With the groupings now defined we can now create a pipeline for summarizing within each group.
purrr::map_dfr(
.x = grps$groupings,
.f = ~ {
mtcars %>%
filter(!!! .x$q) %>%
summarize_mpg() %>%
bind_cols(tibble(!!! .x$g))
}
)
## mean(mpg) vs am
## 1 16.61667 0 <NA>
## 2 24.55714 1 <NA>
## 3 17.14737 <NA> 0
## 4 24.39231 <NA> 1
## 5 15.05000 0 0
## 6 19.75000 0 1
## 7 20.74286 1 0
## 8 28.37143 1 1
That’s the basic idea of gofl
. From the patterns
described above you can specify much more complicated groupings.
Each element of a grouping is even a unique index as follows:
v1-v2-v3-...-vn
, where the order is the order of the
variables given in the create_groupings
data
argument. In the example, above vs
corresponds to the first
position (v1
) and am
to the second
(v2
).vs = 0
corresponds to 1
,
vs = 1
corresponds to 2
. A group not involving
vs
corresponds to 0
.The index 1-0
in our example is the group defined by
vs == 0
; 0-2
is the group defined by
am == 1
; 1-2
is defined by
vs == 0 & am == 1
.
The index_fcn
part of a create_groupings
object allows the user to look up indices.
grps$index_fcn(am = 0)
## [1] "0-1"
grps$index_fcn(am = 0, vs = 1)
## [1] "2-1"
grps$index_fcn(vs = 1)
## [1] "2-0"
Indices are useful when the number of groups is large and you need a way to quickly find a particular group.
In real applications, you may want downstream processes to apply
different functions to different groups. The tag
function
allows you to tag particular groups by arbitrary character
vectors. Let’s make our example above even more unrealistic but
nonetheless illustrative and take the mean in the marginal groups and
the median in joint group.
First, we tag the grouping as appropriate:
mt_groups <- ~ tag(vs + am, "marginal") + tag(vs:am, "joint")
grps <- create_groupings(formula = mt_groups, data = dat)
Note that the tag
element is now populated:
grps$groupings[c(1,5)]
## [[1]]
## [[1]]$i
## [1] "1-0"
##
## [[1]]$q
## [[1]]$q[[1]]
## <quosure>
## expr: ^vs == 0
## env: 0x562c80b46290
##
##
## [[1]]$g
## [[1]]$g$vs
## [1] "0"
##
##
## [[1]]$tags
## [1] "marginal"
##
##
## [[2]]
## [[2]]$i
## [1] "1-1"
##
## [[2]]$q
## [[2]]$q[[1]]
## <quosure>
## expr: ^vs == 0
## env: 0x562c80b46290
##
## [[2]]$q[[2]]
## <quosure>
## expr: ^am == 0
## env: 0x562c80b4ac30
##
##
## [[2]]$g
## [[2]]$g$vs
## [1] "0"
##
## [[2]]$g$am
## [1] "0"
##
##
## [[2]]$tags
## [1] "joint"
An example pipeline taking advantage of the tag
may look
like:
summarize_mpg2 <- function(dt, tag){
avg <- switch(tag, "marginal" = mean, "joint" = median)
summarize(dt, avg(mpg))
}
purrr::map_dfr(
.x = grps$groupings,
.f = ~ {
mtcars %>%
filter(!!! .x$q) %>%
summarize_mpg2(.x$tag) %>%
bind_cols(tibble(!!! .x$g))
}
)
## avg(mpg) vs am
## 1 16.61667 0 <NA>
## 2 24.55714 1 <NA>
## 3 17.14737 <NA> 0
## 4 24.39231 <NA> 1
## 5 15.20000 0 0
## 6 20.35000 0 1
## 7 21.40000 1 0
## 8 30.40000 1 1