collapse_for_tidyverse_users.R

collapse_for_tidyverse_users.R

*collapse* is a C/C++ based package for data transformation
and statistical computing in R that aims to enable greater performance
and statistical complexity in data manipulation tasks and offers a
stable, class-agnostic, and lightweight API. It is part of the core *fastverse*, a
suite of lightweight packages with similar objectives.

The *tidyverse* set
of packages provides a rich, expressive, and consistent syntax for data
manipulation in R centering on the *tibble* object and tidy data
principles (each observation is a row, each variable is a column).

*collapse* fully supports the *tibble* object and
provides many *tidyverse*-like functions for data manipulation.
It can thus be used to write *tidyverse*-like data manipulation
code that, thanks to low-level vectorization of many statistical
operations and optimized R code, typically runs much faster than native
*tidyverse* code (in addition to being much more lightweight in
dependencies).

Its aim is not to create a faster *tidyverse*, i.e., it does
not implements all aspects of the rich *tidyverse* grammar or
changes to it^{1}, and also takes inspiration from other
leading data manipulation libraries to serve broad aims of performance,
parsimony, complexity, and robustness in data manipulation for R.

*collapse* data manipulation functions familiar to
*tidyverse* users include `fselect`

,
`fgroup_by`

, `fsummarise`

, `fmutate`

,
`across`

, `frename`

, and `fcount`

.
Other functions like `fsubset`

, `ftransform`

, and
`get_vars`

are inspired by base R, while again other
functions like `join`

, `pivot`

,
`roworder`

, `colorder`

, `rowbind`

, etc.
are inspired by other data manipulation libraries such as
*data.table* and *polars*.

By virtue of the f- prefixes, the *collapse* namespace has no
conflicts with the *tidyverse*, and these functions can easily be
substituted in a *tidyverse* workflow.

R users willing to replace the *tidyverse* have the additional
option to mask functions and eliminate the prefixes with
`set_collapse`

. For example

collapse_for_tidyverse_users.R

makes available functions `select`

, `group_by`

,
`summarise`

, `mutate`

, `rename`

,
`count`

, `subset`

, and `transform`

in
the *collapse* namespace and detaches and re-attaches the
package, such that the following code is executed by
*collapse*:

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), mean),
qsec_wt = weighted.mean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

*Note* that the correct documentation still needs to be called
with prefixes, i.e., `?fsubset`

. See
`?set_collapse`

for further options to the package, which
also includes optimization options such as `nthreads`

,
`na.rm`

, `sort`

, and `stable.algo`

.
*Note* also that if you use *collapse*’s namespace
masking, you can use `fastverse::fastverse_conflicts()`

to
check for namespace conflicts with other packages.

A key feature of *collapse* is that it not only provides
functions for data manipulation, but also a full set of statistical
functions and algorithms to speed up statistical calculations and
perform more complex statistical operations (e.g. involving weights or
time series data).

Notably among these, the *Fast
Statistical Functions* is a consistent set of S3-generic
statistical functions providing fully vectorized statistical operations
in R.

Specifically, operations such as calculating the mean via the S3
generic `fmean()`

function are vectorized across columns and
groups and may also involve weights or transformations of the original
data:

```
fmean(mtcars$mpg) # Vector
# [1] 20.09062
fmean(EuStockMarkets) # Matrix
# DAX SMI CAC FTSE
# 2530.657 3376.224 2227.828 3565.643
fmean(mtcars) # Data Frame
# mpg cyl disp hp drat wt qsec vs am
# 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250
# gear carb
# 3.687500 2.812500
fmean(mtcars$mpg, w = mtcars$wt) # Weighted mean
# [1] 18.54993
fmean(mtcars$mpg, g = mtcars$cyl) # Grouped mean
# 4 6 8
# 26.66364 19.74286 15.10000
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt) # Weighted group mean
# 4 6 8
# 25.93504 19.64578 14.80643
fmean(mtcars[5:10], g = mtcars$cyl, w = mtcars$wt) # Of data frame
# drat wt qsec vs am gear
# 4 4.031264 2.414750 19.38044 0.9148868 0.6498031 4.047250
# 6 3.569170 3.152060 18.12198 0.6212191 0.3787809 3.821036
# 8 3.205658 4.133116 16.88529 0.0000000 0.1203808 3.240762
fmean(mtcars$mpg, g = mtcars$cyl, w = mtcars$wt, TRA = "fill") # Replace data by weighted group mean
# [1] 19.64578 19.64578 25.93504 19.64578 14.80643 19.64578 14.80643 25.93504 25.93504 19.64578
# [11] 19.64578 14.80643 14.80643 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504
# [21] 25.93504 14.80643 14.80643 14.80643 14.80643 25.93504 25.93504 25.93504 14.80643 19.64578
# [31] 14.80643 25.93504
# etc...
```

collapse_for_tidyverse_users.R

The data manipulation functions of *collapse* are integrated
with these *Fast Statistical Functions* to enable vectorized
statistical operations. For example, the following code

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

gives exactly the same result as above, but the execution is much
faster (especially on larger data), because with *Fast Statistical
Functions*, the data does not need to be split by groups, and there
is no need to call `lapply()`

inside the
`across()`

statement: `fmean.data.frame()`

is
simply applied to a subset of the data containing columns
`mpg`

, `carb`

and `hp`

.

The *Fast Statistical Functions* also have a method for
grouped data, so if we did not want to calculate the weighted mean of
`qsec`

, the code would simplify as follows:

```
mtcars |>
subset(mpg > 11) |>
group_by(cyl, vs, am) |>
select(mpg, carb, hp) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Note that all functions in *collapse*, including the *Fast
Statistical Functions*, have the default `na.rm = TRUE`

,
i.e., missing values are skipped in calculations. This can be changed
using `set_collapse(na.rm = FALSE)`

to give behavior more
consistent with base R.

Another thing to be aware of when using *Fast Statistical
Functions* inside data manipulation functions is that they toggle
vectorized execution wherever they are used. E.g.

```
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + min(qsec)) # Vectorized
# cyl mpg
# 1 4 41.16364
# 2 6 34.24286
# 3 8 29.60000
```

collapse_for_tidyverse_users.R

calculates a grouped mean of `mpg`

but adds the overall
minimum of `qsec`

to the result, whereas

```
mtcars |> group_by(cyl) |> summarise(mpg = fmean(mpg) + fmin(qsec)) # Vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
mtcars |> group_by(cyl) |> summarise(mpg = mean(mpg) + min(qsec)) # Not vectorized
# cyl mpg
# 1 4 43.36364
# 2 6 35.24286
# 3 8 29.60000
```

collapse_for_tidyverse_users.R

both give the mean + the minimum within each group, but calculated in
different ways: the former is equivalent to
`fmean(mpg, g = cyl) + fmin(qsec, g = cyl)`

, whereas the
latter is equal to
`sapply(gsplit(mpg, cyl), function(x) mean(x) + min(x))`

.

See `?fsummarise`

and `?fmutate`

for more
detailed examples. This *eager vectorization* approach is
intentional as it allows users to vectorize complex expressions and fall
back to base R if this is not desired. This
blog post by Andrew Ghazi provides an excellent example of computing
a p-value test statistic by groups.

To take full advantage of *collapse*, it is highly recommended
to use the *Fast Statistical Functions* as much as possible. You
can also set `set_collapse(mask = "all")`

to replace
statistical functions in base R like `sum`

and
`mean`

with the collapse versions (toggling vectorized
execution in all cases), but this may affect other parts of your code^{2}.

It is also performance-critical to correctly sequence operations and
limit excess computations. *tidyverse* code is often inefficient
simply because the *tidyverse* allows you to do everything. For
example,
`mtcars |> group_by(cyl) |> filter(mpg > 13) |> arrange(mpg)`

is permissible but inefficient code as it filters and reorders grouped
data, requiring modifications to both the data frame and the attached
grouping object. *collapse* does not allow calls to
`fsubset()`

on grouped data, and messages about it in
`roworder()`

, encouraging you to write more efficient
code.

The above example can also be optimized because we are subsetting the whole frame and then doing computations on a subset of columns. It would be more efficient to select all required columns during the subset operation:

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp, qsec, wt) |>
group_by(cyl, vs, am) |>
summarise(across(c(mpg, carb, hp), fmean),
qsec_wt = fmean(qsec, wt))
# cyl vs am mpg carb hp qsec_wt
# 1 4 0 1 26.00000 2.000000 91.00000 16.70000
# 2 4 1 0 22.90000 1.666667 84.66667 21.04028
# 3 4 1 1 28.37143 1.428571 80.57143 18.75509
# 4 6 0 1 20.56667 4.666667 131.66667 16.33306
# 5 6 1 0 19.12500 2.500000 115.25000 19.21275
# 6 8 0 0 15.98000 2.900000 191.00000 17.01239
# 7 8 0 1 15.40000 6.000000 299.50000 14.55297
```

collapse_for_tidyverse_users.R

Without the weighted mean of `qsec`

, this would simplify
to

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am) |>
fmean()
# cyl vs am mpg carb hp
# 1 4 0 1 26.00000 2.000000 91.00000
# 2 4 1 0 22.90000 1.666667 84.66667
# 3 4 1 1 28.37143 1.428571 80.57143
# 4 6 0 1 20.56667 4.666667 131.66667
# 5 6 1 0 19.12500 2.500000 115.25000
# 6 8 0 0 15.98000 2.900000 191.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Finally, we could set the following options to toggle unsorted grouping, no missing value skipping, and multithreading across the three columns for more efficient execution.

```
mtcars |>
subset(mpg > 11, cyl, vs, am, mpg, carb, hp) |>
group_by(cyl, vs, am, sort = FALSE) |>
fmean(nthreads = 3, na.rm = FALSE)
# cyl vs am mpg carb hp
# 1 6 0 1 20.56667 4.666667 131.66667
# 2 4 1 1 28.37143 1.428571 80.57143
# 3 6 1 0 19.12500 2.500000 115.25000
# 4 8 0 0 15.98000 2.900000 191.00000
# 5 4 1 0 22.90000 1.666667 84.66667
# 6 4 0 1 26.00000 2.000000 91.00000
# 7 8 0 1 15.40000 6.000000 299.50000
```

collapse_for_tidyverse_users.R

Setting these options globally using
`set_collapse(sort = FALSE, nthreads = 3, na.rm = FALSE)`

avoids the need to set them repeatedly.

Another key to writing efficient code with *collapse* is to
avoid `fgroup_by()`

where possible, especially for mutate
operations. *collapse* does not implement `.by`

arguments to manipulation functions like *dplyr*, but instead
allows ad-hoc grouped transformations through its statistical functions.
For example, the easiest and fastest way to computed the median of
`mpg`

by `cyl`

, `vs`

, and
`am`

is

```
mtcars |>
mutate(mpg_median = fmedian(mpg, list(cyl, vs, am), TRA = "fill")) |>
head(3)
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_median
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 30.4
```

collapse_for_tidyverse_users.R

For the common case of averaging and centering data,
*collapse* also provides functions `fbetween()`

for
averaging and `fwithin()`

for centering, i.e.,
`fbetween(mpg, list(cyl, vs, am))`

is the same as
`fmean(mpg, list(cyl, vs, am), TRA = "fill")`

. There is also
`fscale()`

for (grouped) scaling and centering.

This also applies to multiple columns, where we can use
`fmutate(across(...))`

or `ftransformv()`

,
i.e.

```
mtcars |>
mutate(across(c(mpg, disp, qsec), fmedian, list(cyl, vs, am), TRA = "fill")) |>
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 16.46 0 1 4 4
# Or
mtcars |>
transformv(c(mpg, disp, qsec), fmedian, list(cyl, vs, am), TRA = "fill") |>
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 16.46 0 1 4 4
```

collapse_for_tidyverse_users.R

Of course, if we want to apply different functions using the same
grouping, `fgroup_by()`

is sensible, but for mutate
operations it also has the argument `return.groups = FALSE`

,
which avoids materializing the unique grouping columns, saving some
memory.

```
mtcars |>
group_by(cyl, vs, am, return.groups = FALSE) |>
mutate(mpg_median = fmedian(mpg),
mpg_mean = fmean(mpg), # Or fbetween(mpg)
mpg_demean = fwithin(mpg), # Or fmean(mpg, TRA = "-")
mpg_scale = fscale(mpg),
.keep = "used") |>
ungroup() |>
head(3)
# mpg cyl vs am mpg_median mpg_mean mpg_demean mpg_scale
# Mazda RX4 21.0 6 0 1 21.0 20.56667 0.4333333 0.5773503
# Mazda RX4 Wag 21.0 6 0 1 21.0 20.56667 0.4333333 0.5773503
# Datsun 710 22.8 4 1 1 30.4 28.37143 -5.5714286 -1.1710339
```

collapse_for_tidyverse_users.R

The `TRA`

argument supports a whole array of operations,
see `?TRA`

. For example `fsum(mtcars, TRA = "/")`

turns the column vectors into proportions. As an application of this,
consider a generated dataset of sector-level exports.

```
# c = country, s = sector, y = year, v = value
exports <- expand.grid(c = paste0("c", 1:8), s = paste0("s", 1:8), y = 1:15) |>
mutate(v = round(abs(rnorm(length(c), mean = 5)), 2)) |>
subset(-sample.int(length(v), 360)) # Making it unbalanced and irregular
head(exports)
# c s y v
# 1 c2 s1 1 5.55
# 2 c3 s1 1 4.33
# 3 c4 s1 1 5.21
# 4 c5 s1 1 5.31
# 5 c6 s1 1 6.17
# 6 c7 s1 1 5.62
nrow(exports)
# [1] 600
```

collapse_for_tidyverse_users.R

It is very easy then to compute Balassa’s (1965) Revealed Comparative Advantage (RCA) index, which is the share of a sector in country exports divided by the share of the sector in world exports. An index above 1 indicates that a RCA of country c in sector s.

```
# Computing Balassa's (1965) RCA index: fast and memory efficient
# settfm() modifies exports and assigns it back to the global environment
settfm(exports, RCA = fsum(v, list(c, y), TRA = "/") %/=% fsum(v, list(s, y), TRA = "/"))
```

collapse_for_tidyverse_users.R

Note that this involved a single expression with two different
grouped operations, which is only possible by incorporating grouping
into statistical functions themselves. Let’s summarise this dataset
using `pivot()`

to aggregate the RCA index across years. Here
`"mean"`

calls a highly efficient internal mean function.

```
pivot(exports, ids = "c", values = "RCA", names = "s",
how = "wider", FUN = "mean", sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 0.9327521 0.9087815 0.9434970 1.105864 1.158613 0.9579166 1.1094150 1.218718
# 2 c2 1.4989832 1.0502050 0.8113781 1.024990 1.103707 1.1494829 1.0681358 1.021685
# 3 c3 1.0403483 0.9580809 0.8358023 1.024633 1.192487 0.9333733 1.0719161 1.010648
# 4 c4 0.9771630 1.0265800 0.9293951 1.007469 1.052942 0.9285248 1.4031524 1.027218
# 5 c5 0.9807908 1.1023470 0.8480027 1.080013 1.072168 0.9704144 1.1817784 1.099050
# 6 c6 0.9819940 1.1434701 0.9122508 1.164649 1.193275 0.9322847 0.9929571 1.177062
# 7 c7 1.1542193 1.1939893 0.7462051 1.109936 1.438044 1.0482547 1.5907867 1.055214
# 8 c8 1.4220817 1.2235288 0.7090515 1.189408 1.119605 1.3108897 1.3264848 1.279526
```

collapse_for_tidyverse_users.R

We may also wish to investigate the growth rate of RCA. This can be
done using `fgrowth()`

. Since the panel is irregular, i.e.,
not every sector is observed in every year, it is critical to also
supply the time variable.

```
exports |>
mutate(RCA_growth = fgrowth(RCA, g = list(c, s), t = y)) |>
pivot(ids = "c", values = "RCA_growth", names = "s",
how = "wider", FUN = fmedian, sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 NA 29.87093 56.837880 0.3513705 11.9750588 6.356499 5.186966 3.4725766
# 2 c2 -19.092254 -10.72516 50.412427 8.7380006 -25.7119274 -17.958011 -36.853824 -30.5827161
# 3 c3 -3.904880 -29.72276 4.338254 4.2112875 13.8705938 -27.368230 -5.214542 -10.4867005
# 4 c4 0.639523 19.74757 -9.602120 9.7104112 42.0912878 17.583594 -27.915967 -18.1145784
# 5 c5 8.184523 18.93554 -5.333235 1.5243547 -0.3306585 8.682935 -15.678443 18.3991608
# 6 c6 12.606978 67.07558 19.270685 43.8243108 -25.0283737 -21.785028 -10.059702 0.7774246
# 7 c7 24.400344 48.56792 27.552571 -16.9311897 -6.6046775 -28.627885 -12.092345 24.5298895
# 8 c8 158.342022 17.99249 -61.857965 36.3372079 0.2085139 -2.178978 -18.666774 -40.5714063
```

collapse_for_tidyverse_users.R

Lastly, since the panel is unbalanced, we may wish to create an RCA index for only the last year, but balance the dataset a bit more by taking the last available trade within the last three years. This can be done using a single subset call

```
# Taking the latest observation within the last 3 years
exports_latest <- subset(exports, y > 12 & y == fmax(y, list(c, s), "fill"), -y)
# How many sectors do we observe for each country in the last 3 years?
with(exports_latest, fndistinct(s, c))
# c1 c2 c3 c4 c5 c6 c7 c8
# 8 8 7 7 8 8 6 8
```

collapse_for_tidyverse_users.R

We can then compute the RCA index on this data

```
exports_latest |>
mutate(RCA = fsum(v, c, TRA = "/") %/=% fsum(v, s, TRA = "/")) |>
pivot("c", "RCA", "s", how = "wider", sort = TRUE)
# c s1 s2 s3 s4 s5 s6 s7 s8
# 1 c1 0.9038055 0.9073996 0.7608879 0.5752643 0.8558140 0.6619450 0.8820296 0.9617336
# 2 c2 1.1725178 1.1771805 0.9871092 0.7462973 1.1102578 0.8587493 1.1442677 1.2476687
# 3 c3 1.2072861 1.2120870 1.0163796 NA 1.1431799 0.8842135 1.1781982 1.2846653
# 4 c4 1.2438173 1.2487635 1.0471341 0.7916788 1.1777713 NA 1.2138493 1.3235380
# 5 c5 1.0014055 1.0053877 0.8430546 0.6373858 0.9482314 0.7334270 0.9772781 1.0655891
# 6 c6 1.0234618 1.0275317 0.8616232 0.6514245 0.9691166 0.7495810 0.9988030 1.0890591
# 7 c7 1.3447625 1.3501101 NA NA 1.2733564 0.9849009 1.3123624 1.4309531
# 8 c8 1.1226366 1.1271008 0.9451155 0.7145483 1.0630252 0.8222164 1.0955882 1.1945903
```

collapse_for_tidyverse_users.R

To summarise, *collapse* provides many options for ad-hoc or
limited grouping, which are faster than a full `fgroup_by()`

,
and also syntactically efficient. Further efficiency gains are possible
using operations by reference, e.g., `%/=%`

instead of
`/`

to avoid an intermediate copy. It is also possible to
transform by reference using fast statistical functions by passing the
`set = TRUE`

argument, e.g.,
`with(mtcars, fmean(mpg, cyl, TRA = "fill", set = TRUE))`

replaces `mpg`

by its group-averaged version (the transformed
vector is returned invisibly).

*collapse* enhances R both statistically and computationally
and is a good option for *tidyverse* users searching for more
efficient and lightweight solutions to data manipulation and statistical
computing problems in R. For more information, I recommend starting with
the short vignette on *Documentation
Resources*.

R users willing to write efficient/lightweight code and completely
replace the *tidyverse* in their workflow are also encouraged to
closely examine the *fastverse*
suite of packages. *collapse* alone may not always suffice, but
99% of *tidyverse* code can be replaced with an efficient and
lightweight *fastverse* solution.

collapse_for_tidyverse_users.R