In

`GRP.default()`

, the`"group.starts"`

attribute is always returned, even if there is only one group or every observation is its own group. Thanks @JamesThompsonC (#631).Fixed a bug in

`pivot()`

if`na.rm = TRUE`

and`how = "wider"|"recast"`

and there are multiple`value`

columns with different missingness patterns. In this case`na_omit(values)`

was applied with default settings to the original (long) value columns, implying potential loss of information. The fix applies`na_omit(values, prop = 1)`

, i.e., only removes completely missing rows.`qDF()/qDT()/qTBL()`

now allow a length-2 vector of names to`row.names.col`

if`X`

is a named atomic vector, e.g.,`qDF(fmean(mtcars), c("cars", "mean"))`

gives the same as`pivot(fmean(mtcars, drop = FALSE), names = list("car", "mean"))`

.Added a subsection on using internal (ad-hoc) grouping to the

*collapse*for*tidyverse*users vignette.`qsu()`

now adds a`WeightSum`

column giving the sum of (non-zero or missing) weights if the`w`

argument is used. Thanks @mayer79 for suggesting (#650). For panel data (`pid`

) the ‘Between’ sum of weights is also simply the number of groups, and the ‘Within’ sum of weights is the ‘Overall’ sum of weights divided by the number of groups.Fixed an inaccuracy in

`fquantile()/fnth()`

with weights: As per documentation the target sum is`sumwp = (sum(w) - min(w)) * p`

, however, in practice, the weight of the minimum element of`x`

was used instead of the minimum weight. Since the smallest element in the sample usually has a small weight this was unnoticed for a long while, but thanks to @Jahnic-kb now reported and fixed (#659).Fixed a bug in

`recode_char()`

when`regex = TRUE`

and the`default`

argument was used. Thanks @alinacherkas for both reporing and fixing (#654).

Fixes an installation bug on some Linux systems (conflicting types) (#613).

*collapse*now enforces string encoding in`fmatch()`

/`join()`

, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is internally coerced to UTF8 strings*before*the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons,*collapse*assumes that character vectors are uniform in terms of string encoding. Heterogeneous strings should be coerced using tools like`stringi::stri_trans_general(x, "latin-ascii")`

.Fixes a bug using qualified names for fast statistical functions inside

`across()`

(#621, thanks @alinacherkas).*collapse*now depends on R >= 3.4.0 due to the enforcement of`STRICT_R_HEADERS = 1`

from R v4.5.0. In particular R API functions were renamed`Calloc -> R_Calloc`

and`Free -> R_Free`

.

Some changes on the C-side to move the package closer to C API compliance (demanded by R-Core). One notable change is that

`gsplit()`

no longer supports S4 objects (because`SET_S4_OBJECT`

is not part of the API and`asS4()`

is too expensive for tight loops). I cannot think of a single example where it would be necessary to split an S4 object, but if you do have applications please file an issue.`pivot()`

has new arguments`FUN = "last"`

and`FUN.args = NULL`

, allowing wide and recast pivots with aggregation (default last value as before).`FUN`

currently supports a single function returning a scalar value.*Fast Statistical Functions*receive vectorized execution.`FUN.args`

can be used to supply a list of function arguments, including data-length arguments such as weights. There are also a couple of internal functions callable using function strings:`"first"`

,`"last"`

,`"count"`

,`"sum"`

,`"mean"`

,`"min"`

, or`"max"`

. These are built into the reshaping C-code and thus extremely fast. Thanks @AdrianAntico for the request (#582).`join()`

now provides enhanced verbosity, indicating the average order of the join between the two tables, e.g.`join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3))) #> left join: x[id] 3/4 (75%) <1.5:1st> y[id] 2/6 (33.3%) #> id #> 1 1 #> 2 2 #> 3 2 #> 4 4 join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3)), multiple = TRUE) #> left join: x[id] 3/4 (75%) <1.5:2.5> y[id] 5/6 (83.3%) #> id #> 1 1 #> 2 1 #> 3 1 #> 4 1 #> 5 2 #> 6 2 #> 7 4`

In

`collap()`

, with multiple functions passed to`FUN`

or`catFUN`

and`return = "long"`

, the`"Function"`

column is now generated as a factor variable instead of character (which is more efficient).

Updated ‘

*collapse*and*sf*’ vignette to reflect the recent support for*units*objects, and added a few more examples.Fixed a bug in

`join()`

where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting.Added function

`group_by_vars()`

: A standard evaluation version of`fgroup_by()`

that is slimmer and safer for programming, e.g.`data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))`

. Or, using*magrittr*:`library(magrittr) set_collapse(mask = "manip") # for fgroup_vars -> group_vars %>% data group_by_vars(ind1) %>% { add_vars( group_vars(., "unique"), get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"), get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_") ) }`

Added function

`as_integer_factor()`

to turn factors/factor columns into integer vectors.`as_numeric_factor()`

already exists, but is memory inefficient for most factors where levels can be integers.`join()`

now internally checks if the rows of the joined datasets match exactly. This check, using`identical(m, seq_row(y))`

, is inexpensive, but, if`TRUE`

, saves a full subset and deep copy of`y`

. Thus`join()`

now inherits the intelligence already present in functions like`fsubset()`

,`roworder()`

and`funique()`

- a key for efficient data manipulation is simply doing less.In

`join()`

, if`attr = TRUE`

, the`count`

option to`fmatch()`

is always invoked, so that the attribute attached always has the same form, regardless of`verbose`

or`validate`

settings.`roworder[v]()`

has optional setting`verbose = 2L`

to indicate if`x`

is already sorted, making the call to`roworder[v]()`

redundant.

*collapse*now explicitly supports*xts*/*zoo*and*units*objects and concurrently removes an additional check in the`.default`

method of statistical functions that called the matrix method if`is.matrix(x) && !inherits(x, "matrix")`

. This was a smart solution to account for the fact that*xts*objects are matrix-based but don’t inherit the`"matrix"`

class, thus wrongly calling the default method. The same is the case for*units*, but here, my recent more intensive engagement with spatial data convinced me that this should be changed. For one, under the previous heuristic solution, it was not possible to call the default method on a*units*matrix, e.g.,`fmean.default(st_distance(points_sf))`

called`fmean.matrix()`

and yielded a vector. This should not be the case. Secondly, aggregation e.g.`fmean(st_distance(points_sf))`

or`fmean(st_distance(points_sf), g = group_vec)`

yielded a plain numeric object that lost the*units*class (in line with the general attribute handling principles). Therefore, I have now decided to remove the heuristic check within the default methods, and explicitly support*zoo*and*units*objects. For*Fast Statistical Functions*, the methods are`FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ...)`

and`FUN.units <- function(x, ...) if(is.matrix(x)) copyMostAttrib(FUN.matrix(x, ...), x) else FUN.default(x, ...)`

. While the behavior for*xts*/*zoo*remains the same, the behavior for*units*is enhanced, as now the class is preserved in aggregations (the`.default`

method preserves attributes except for*ts*), and it is possible to manually invoke the`.default`

method on a units matrix and obtain an aggregate statistic. This change may impact computations on other matrix based classes which don’t inherit from`"matrix"`

(*mts*does inherit from`"matrix"`

, and I am not aware of any other affected classes, but user code like`m <- matrix(rnorm(25), 5); class(m) <- "bla"; fmean(m)`

will now yield a scalar instead of a vector. Such code must be adjusted to either`class(m) <- c("bla", "matrix")`

or`fmean.matrix(m)`

). Overall, the change makes*collapse*behave in a more standard and predictable way, and enhances its support for*units*objects central in the*sf*ecosystem.`fquantile()`

now also preserves the attributes of the input, in line with`quantile()`

.

- Fixes some issues with signed int overflows inside hash functions and possible protect bugs flagged by RCHK.

An

**article**on*collapse*has been submitted to the Journal of Statistical Software. The preprint is available through*arXiv*.Removed

*magrittr*from most documentation examples (using base pipe).Improved

`plot.GRP`

a little bit - on request of JSS editors.

Fixed a bug in

`fmatch()`

when matching integer vectors to factors. This also affected`join()`

.Improved cross-platform compatibility of OpenMP flags. Thanks @kalibera.

Added

`stub = TRUE`

argument to the*grouped_df*methods of*Fast Statistical Functions*supporting weights, to be able to remove or alter prefixes given to aggregated weights columns if`keep.w = TRUE`

. Globally, users can set`st_collapse(stub = FALSE)`

to disable this prefixing in all statistical functions and operators.

Added functions

`na_locf()`

and`na_focb()`

for fast basic C implementations of these procedures (optionally by reference).`replace_na()`

now also has a`type`

argument which supports options`"locf"`

and`"focb"`

(default`"const"`

), similar to`data.table::nafill`

. The implementation also supports character data and list-columns (`NULL/empty`

elements). Thanks @BenoitLondon for suggesting (#489). I note that`na_locf()`

exists in some other packages (such as*imputeTS*) where it is implemented in R and has additional options. Users should utilize the flexible namespace i.e.`set_collapse(remove = "na_locf")`

to deal with this.Fixed a bug in weighted quantile estimation (

`fquantile()`

) that could lead to wrong/out-of-range estimates in some cases. Thanks @zander-prinsloo for reporting (#523).Improved right join such that join column names of

`x`

instead of`y`

are preserved. This is more consistent with the other joins when join columns in`x`

and`y`

have different names.More fluent and safe interplay of ‘mask’ and ‘remove’ options in

`set_collapse()`

: it is now seamlessly possible to switch from any combination of ‘mask’ and ‘remove’ to any other combination without the need of setting them to`NULL`

first.

In

`pivot(..., values = [multiple columns], labels = "new_labels_column", how = "wieder")`

, if the columns selected through`values`

already have variable labels, they are concatenated with the new labels provided through`"new_labels_col"`

using`" - "`

as a separator (similar to`names`

where the separator is`"_"`

).`whichv()`

and operators`%==%`

,`%!=%`

now properly account for missing double values, e.g.`c(NA_real_, 1) %==% c(NA_real_, 1)`

yields`c(1, 2)`

rather than`2`

. Thanks @eutwt for flagging this (#518).In

`setv(X, v, R)`

, if the type of`R`

is greater than`X`

e.g.`setv(1:10, 1:3, 9.5)`

, then a warning is issued that conversion of`R`

to the lower type (real to integer in this case) may incur loss of information. Thanks @tony-aw for suggesting (#498).`frange()`

has an option`finite = FALSE`

, like`base::range`

. Thanks @MLopez-Ibanez for suggesting (#511).`varying.pdata.frame(..., any_group = FALSE)`

now unindexes the result (as should be the case).

Fixed bug in full join if

`verbose = 0`

. Thanks @zander-prinsloo for reporting.Added argument

`multiple = FALSE`

to`join()`

. Setting`multiple = TRUE`

performs a multiple-matching join where a row in`x`

is matched to all matching rows in`y`

. The default`FALSE`

just takes the first matching row in`y`

.Improved recode/replace functions. Notably,

`replace_outliers()`

now supports option`value = "clip"`

to replace outliers with the respective upper/lower bounds, and also has option`single.limit = "mad"`

which removes outliers exceeding a certain number of median absolute deviations. Furthermore, all functions now have a`set`

argument which fully applies the transformations by reference.Functions

`replace_NA`

and`replace_Inf`

were renamed to`replace_na`

and`replace_inf`

to make the namespace a bit more consistent. The earlier versions remain available.

Fixed a serious bug in

`qsu()`

where higher order weighted statistics were erroneous, i.e. whenever`qsu(x, ..., w = weights, higher = TRUE)`

was invoked, the ‘SD’, ‘Skew’ and ‘Kurt’ columns were wrong (if`higher = FALSE`

the weighted ‘SD’ is correct). The reason is that there appears to be no straightforward generalization of Welford’s Online Algorithm to higher-order weighted statistics. This was not detected earlier because the algorithm was only tested with unit weights. The fix involved replacing Welford’s Algorithm for the higher-order weighted case by a 2-pass method, that additionally uses long doubles for higher-order terms. Thanks @randrescastaneda for reporting.Fixed some unexpected behavior in

`t_list()`

where names ‘V1’, ‘V2’, etc. were assigned to unnamed inner lists. It now preserves the missing names. Thanks @orgadish for flagging this.

In

`join`

, the if`y`

is an expression e.g.`join(x = mtcars, y = subset(mtcars, mpg > 20))`

, then its name is not extracted but just set to`"y"`

. Before, the name of`y`

would be captured as`as.character(substitute(y))[1] = "subset"`

in this case. This is an improvement mainly for display purposes, but could also affect code if there are duplicate columns in both datasets and`suffix`

was not provided in the`join`

call: before, y-columns would be renamed using a (non-sensible)`"_subset"`

suffix, but now using a`"_y"`

suffix. Note that this only concerns cases where`y`

is an expression rather than a single object.Small performance improvements to

`%[!]in%`

operators:`%!in%`

now uses`is.na(fmatch(x, table))`

rather than`fmatch(x, table, 0L) == 0L`

, and`%in%`

, if exported using`set_collapse(mask = "%in%"|"special"|"all")`

is`as.logical(fmatch(x, table, 0L))`

instead of`fmatch(x, table, 0L) > 0L`

. The latter are faster because comparison operators`>`

,`==`

with integers additionally need to check for`NA`

’s (= the smallest integer in C).

In

`fnth()/fquantile()`

, there has been a slight change to the weighted quantile algorithm. As outlined in the documentation, this algorithm gives weighted versions for all continuous quantile methods (type 7-9) in R by replacing sample quantities with their weighted counterparts. E.g., for the default quantile type 7, the continuous (lower) target element is`(n - 1) * p`

. In the weighted algorithm, this became`(sum(w) - mean(w)) * p`

and was compared to the cumulative sum of ordered (by`x`

) weights, to preserve equivalence of the algorithms in cases where the weights are all equal. However, upon a second thought, the use of`mean(w)`

does not really reflect a standard interpretation of the weights as frequencies. I have reasoned that using`min(w)`

instead of`mean(w)`

better reflects such an interpretation, as the minimum (non-zero) weight reflects the size of the smallest sampled unit. So the weighted quantile type 7 target is now`(sum(w) - min(w)) * p`

, and also the other methods have been adjusted accordingly (note that zero weight observations are ignored in the algorithm).This is more a

*Note*than a change to the package: there is an issue with*vctrs*that users can encounter using*collapse*together with the*tidyverse*(especially*ggplot2*), which is that*collapse*internally optimizes computations on factors by giving them an additional`"na.included"`

class if they are known to not contain any missing values. For example`pivot(mtcars)`

gives a`"variable"`

factor which has class`c("factor", "na.included")`

, such that grouping on`"variable"`

in subsequent operations is faster. Unfortunately,`pivot(mtcars) |> ggplot(aes(y = value)) + geom_histogram() + facet_wrap( ~ variable)`

currently gives an error produced by*vctrs*, because*vctrs*does not implement a standard S3 method dispatch and thus does not ignore the`"na.included"`

class. It turns out that the only way for me to deal with this is would be to swap the order of classes i.e.`c("na.included", "factor")`

, import*vctrs*, and implement`vec_ptype2`

and`vec_cast`

methods for`"na.included"`

objects. This will never happen, as*collapse*is and will remain independent of the*tidyverse*. There are two ways you can deal with this: The first way is to remove the`"na.included"`

class for*ggplot2*e.g.`facet_wrap( ~ set_class(variable, "factor"))`

or`facet_wrap( ~ factor(variable))`

will both work. The second option is to define a function`vec_ptype2.factor.factor <- function(x, y, ...) x`

in your global environment, which avoids*vctrs*performing extra checks on factor objects.

Fixed a signed integer overflow inside a hash function detected by CRAN checks (changing to unsigned int).

Updated the cheatsheet (see README.md).

Added global option ‘stub’ (default

`TRUE`

) to`set_collapse`

. It is passed to the`stub(s)`

arguments of the statistical operators,`B`

,`W`

,`STD`

,`HDW`

,`HDW`

,`L`

,`D`

,`Dlog`

,`G`

(in`.OPERATOR_FUN`

). By default these operators add a prefix/stub to transformed matrix or data.frame columns. Setting`set_collapse(stub = FALSE)`

now allows to switch off this behavior such that columns are not prepended with a prefix (by default).`roworder[v]()`

now also supports grouped data frames, but prints a message indicating that this is inefficient (also for indexed data). An additional argument`verbose`

can be set to`0`

to avoid such messages.

`%in%`

with`set_collapse(mask = "%in%")`

does not warn anymore about overidentification when used with data frames (i.e. using`overid = 2`

in`fmatch()`

).Fixed several typos in the documentation.

*collapse* 2.0, released in Mid-October 2023, introduces fast
table joins and data reshaping capabilities alongside other convenience
functions, and enhances the packages global configurability, including
interactive namespace control.

- In a grouped setting, if
`.data`

is used inside`fsummarise()`

and`fmutate()`

, and`.cols = NULL`

,`.data`

will contain all columns except for grouping columns (in-line with the`.SD`

syntax of*data.table*). Before,`.data`

contained all columns. The selection in`.cols`

still refers to all columns, thus it is still possible to select all columns using e.g.`grouped_data %>% fsummarise(some_expression_involving(.data), .cols = seq_col(.))`

.

- In
`qsu()`

, argument`vlabels`

was renamed to`labels`

. But`vlabels`

will continue to work.

- Fixed a bug in the integer methods of
`fsum()`

,`fmean()`

and`fprod()`

that returned`NA`

if and only if there was a single integer followed by`NA`

’s e.g`fsum(c(1L, NA, NA))`

erroneously gave`NA`

. This was caused by a C-level shortcut that returned`NA`

when the first element of the vector had been reached (moving from back to front) without encountering any non-NA-values. The bug consisted in the content of the first element not being evaluated in this case. Note that this bug did not occur with real numbers, and also not in grouped execution. Thanks @blset for reporting (#432).

Added

`join()`

: class-agnostic, vectorized, and (default) verbose joins for R, modeled after the*polars*API. Two different join algorithms are implemented: a hash-join (default, if`sort = FALSE`

) and a sort-merge-join (if`sort = TRUE`

).Added

`pivot()`

: fast and easy data reshaping! It supports longer, wider and recast pivoting, including handling of variable labels, through a uniform and parsimonious API. It does not perform data aggregation, and by default does not check if the data is uniquely identified by the supplied ids. Underidentification for ‘wide’ and ‘recast’ pivots results in the last value being taken within each group. Users can toggle a duplicates check by setting`check.dups = TRUE`

.Added

`rowbind()`

: a fast class-agnostic alternative to`rbind.data.frame()`

and`data.table::rbindlist()`

.Added

`fmatch()`

: a fast`match()`

function for vectors and data frames/lists. It is the workhorse function of`join()`

, and also benefits`ckmatch()`

,`%!in%`

, and new operators`%iin%`

and`%!iin%`

(see below). It is also possible to`set_collapse(mask = "%in%")`

to replace`base::"%in%"`

using`fmatch()`

. Thanks to`fmatch()`

, these operators also all support data frames/lists of vectors, which are compared row-wise.Added operators

`%iin%`

and`%!iin%`

: these directly return indices, i.e.`%[!]iin%`

is equivalent to`which(x %[!]in% table)`

. This is useful especially for subsetting where directly supplying indices is more efficient e.g.`x[x %[!]iin% table]`

is faster than`x[x %[!]in% table]`

. Similarly`fsubset(wlddev, iso3c %iin% c("DEU", "ITA", "FRA"))`

is very fast.Added

`vec()`

: efficiently turn matrices or data frames / lists into a single atomic vector. I am aware of multiple implementations in other packages, which are mostly inefficient. With atomic objects,`vec()`

simply removes the attributes without copying the object, and with lists it directly calls`C_pivot_longer`

.

`set_collapse()`

now supports options ‘mask’ and ‘remove’, giving*collapse*a flexible namespace in the broadest sense that can be changed at any point within the active session:‘mask’ supports base R or

*dplyr*functions that can be masked into the faster*collapse*versions. E.g.`library(collapse); set_collapse(mask = "unique")`

(or, equivalently,`set_collapse(mask = "funique")`

) will create`unique <- funique`

in the*collapse*namespace, export`unique()`

from the namespace, and detach and attach the namespace again so R can find it. The re-attaching also ensures that*collapse*comes right after the global environment, implying that all it’s functions will take priority over other libraries. Users can use`fastverse::fastverse_conflicts()`

to check which functions are masked after using`set_collapse(mask = ...)`

. The option can be changed at any time. Using`set_collapse(mask = NULL)`

removes all masked functions from the namespace, and can also be called simply to ensure*collapse*is at the top of the search path.‘remove’ allows removing arbitrary functions from the

*collapse*namespace. E.g.`set_collapse(remove = "D")`

will remove the difference operator`D()`

, which also exists in*stats*to calculate symbolic and algorithmic derivatives (this is a convenient example but not necessary since`collapse::D`

is S3 generic and will call`stats::D()`

on R calls, expressions or names). This is safe to do as it only modifies which objects are exported from the namespace (it does not truly remove objects from the namespace). This option can also be changed at any time.`set_collapse(remove = NULL)`

will restore the exported namespace.

For both options there exist a number of convenient keywords to bulk-mask / remove functions. For example

`set_collapse(mask = "manip", remove = "shorthand")`

will mask all data manipulation functions such as`mutate <- fmutate`

and remove all function shorthands such as`mtt`

(i.e. abbreviations for frequently used functions that*collapse*supplies for faster coding / prototyping).

`set_collapse()`

also supports options ‘digits’, ‘verbose’ and ‘stable.algo’, enhancing the global configurability of*collapse*.`qM()`

now also has a`row.names.col`

argument in the second position allowing generation of rownames when converting data frame-like objects to matrix e.g.`qM(iris, "Species")`

or`qM(GGDC10S, 1:5)`

(interaction of id’s).`as_factor_GRP()`

and`finteraction()`

now have an argument`sep = "."`

denoting the separator used for compound factor labels.`alloc()`

now has an additional argument`simplify = TRUE`

.`FALSE`

always returns list output.`frename()`

supports both`new = old`

(*pandas*, used to far) and`old = new`

(*dplyr*) style renaming conventions.`across()`

supports negative indices, also in grouped settings: these will select all variables apart from grouping variables.`TRA()`

allows shorthands`"NA"`

for`"replace_NA"`

and`"fill"`

for`"replace_fill"`

.`group()`

experienced a minor speedup with >= 2 vectors as the first two vectors are now hashed jointly.`fquantile()`

with`names = TRUE`

adds up to 1 digit after the comma in the percent-names, e.g.`fquantile(airmiles, probs = 0.001)`

generates appropriate names (not 0% as in the previous version).

New vignette on

*collapse*’s Handling of R Objects: provides an overview of collapse’s (internal) class-agnostic R programming framework.`print.descr()`

with groups and option`perc = TRUE`

(the default) also shows percentages of the group frequencies for each variable.`funique(mtcars[NULL, ], sort = TRUE)`

gave an error (for data frame with zero rows). Thanks @NicChr (#406).Added SIMD vectorization for

`fsubset()`

.`vlengths()`

now also works for strings, and is hence a much faster version of both`lengths()`

and`nchar()`

. Also for atomic vectors the behavior is like`lengths()`

, e.g.`vlengths(rnorm(10))`

gives`rep(1L, 10)`

.In

`collap[v/g]()`

, the`...`

argument is now placed after the`custom`

argument instead of after the last argument, in order to better guard against unwanted partial argument matching. In particular, previously the`n`

argument passed to`fnth`

was partially matched to`na.last`

. Thanks @ummel for alerting me of this (#421).

Using

`DATAPTR_RO`

to point to R lists because of the use of`ALTLISTS`

on R-devel.Replacing

`!=`

loop controls for SIMD loops with`<`

to ensure compatibility on all platforms. Thanks @albertus82 (#399).

Improvements in

`get_elem()/has_elem()`

: Option`invert = TRUE`

is implemented more robustly, and a function passed to`get_elem()/has_elem()`

is now applied to all elements in the list, including elements that are themselves list-like. This enables the use of`inherits`

to find list-like objects inside a broader list structure e.g.`get_elem(l, inherits, what = "lm")`

fetches all linear model objects inside`l`

.Fixed a small bug in

`descr()`

introduced in v1.9.0, producing an error if a data frame contained no numeric columns - because an internal function was not defined in that case. Also, POSIXct columns are handled better in print - preserving the time zone (thanks @cdignam-chwy #392).`fmean()`

and`fsum()`

with`g = NULL`

, as well as`TRA()`

,`setop()`

, and related operators`%r+%`

,`%+=%`

etc.,`setv()`

and`fdist()`

now utilize Single Instruction Multiple Data (SIMD) vectorization by default (if OpenMP is enabled), enabling potentially very fast computing speeds. Whether these instructions are utilized during compilation depends on your system. In general, if you want to max out*collapse*on your system, consider compiling from source with`CFLAGS += -O3 -march=native -fopenmp`

and`CXXFLAGS += -O3 -march=native`

in your`.R/Makevars`

.

Added functions

`fduplicated()`

and`any_duplicated()`

, for vectors and lists / data frames. Thanks @NicChr (#373)`sort`

option added to`set_collapse()`

to be able to set unordered grouping as a default. E.g. setting`set_collapse(sort = FALSE)`

will affect`collap()`

,`BY()`

,`GRP()`

,`fgroup_by()`

,`qF()`

,`qG()`

,`finteraction()`

,`qtab()`

and internal use of these functions for ad-hoc grouping in fast statistical functions. Other uses of`sort`

, for example in`funique()`

where the default is`sort = FALSE`

, are not affected by the global default setting.Fixed a small bug in

`group()`

/`funique()`

resulting in an unnecessary memory allocation error in rare cases. Thanks @NicChr (#381).

Further fix to an Address Sanitizer issue as required by CRAN (eliminating an unused out of bounds access at the end of a loop).

`qsu()`

finally has a grouped_df method.Added options

`option("collapse_nthreads")`

and`option("collapse_na.rm")`

, which allow you to load*collapse*with different defaults e.g. through an`.Rprofile`

or`.fastverse`

configuration file. Once*collapse*is loaded, these options take no effect, and users need to use`set_collapse()`

to change`.op[["nthreads"]]`

and`.op[["na.rm"]]`

interactively.Exported method

`plot.psmat()`

(can be useful to plot time series matrices).

Fixed minor C/C++ issues flagged by CRAN’s detailed checks.

Added functions

`set_collapse()`

and`get_collapse()`

, allowing you to globally set defaults for the`nthreads`

and`na.rm`

arguments to all functions in the package. E.g.`set_collapse(nthreads = 4, na.rm = FALSE)`

could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of`.op`

, such that these defaults are received using e.g.`.op[["nthreads"]]`

, at the computational cost of a few nanoseconds (8-10x faster than`getOption("nthreads")`

which would take about 1 microsecond).`.op`

is not accessible by the user, so function`get_collapse()`

can be used to retrieve settings. Exempt from this are functions`.quantile`

, and a new function`.range`

(alias of`frange`

), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Function`descr()`

, which internally calls a bunch of statistical functions, is also not affected by these settings.Further improvements in thread safety for

`fsum()`

and`fmean()`

in grouped computations across data frame columns. All OpenMP enabled functions in*collapse*can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.

*collapse* 1.9.0 released mid of January 2023, provides
improvements in performance and versatility in many areas, as well as
greater statistical capabilities, most notably efficient (grouped,
weighted) estimation of sample quantiles.

All functions renamed in

*collapse*1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See`help("collapse-renamed")`

.The lead operator

`F()`

is not exported anymore from the package namespace, to avoid clashes with`base::F`

flagged by multiple people. The operator is still part of the package and can be accessed using`collapse:::F`

. I have also added an option`"collapse_export_F"`

, such that setting`options(collapse_export_F = TRUE)`

before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347).Function

`fnth()`

has a new default`ties = "q7"`

, which gives the same result as`quantile(..., type = 7)`

(R’s default). More details below.

`fmode()`

gave wrong results for singleton groups (groups of size 1) on*unsorted*data. I had optimized`fmode()`

for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred for`fndistinct`

if singleton groups were`NA`

, which were counted as`1`

instead of`0`

under the`na.rm = TRUE`

default (provided the first element of the vector/data was not`NA`

). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using*collapse*on a daily basis, and it escaped 700 unit tests as well).Function

`groupid(x, na.skip = TRUE)`

returned uninitialized first elements if the first values in`x`

where`NA`

. Thanks for reporting @Henrik-P (#335).Fixed a bug in the

`.names`

argument to`across()`

. Passing a naming function such as`.names = function(c, f) paste0(c, "-", f)`

now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) using`outer()`

. Previously this was just internally evaluated as`.names(cols, funs)`

, which did not work if there were multiple cols and multiple funs. There is also now a possibility to set`.names = "flip"`

, which names columns`f_c`

instead of`c_f`

.`fnrow()`

was rewritten in C and also supports data frames with 0 columns. Similarly for`seq_row()`

. Thanks @NicChr (#344).

Added functions

`fcount()`

and`fcountv()`

: a versatile and blazing fast alternative to`dplyr::count`

. It also works with vectors, matrices, as well as grouped and indexed data.Added function

`fquantile()`

: Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster than`stats::quantile`

on larger vectors, but also especially fast on smaller data, where the R overhead of`stats::quantile`

becomes burdensome. For maximum performance during repeated executions, a programmers version`.quantile()`

with different defaults is also provided.Added function

`fdist()`

: A fast and versatile replacement for`stats::dist`

. It computes a full euclidean distance matrix around 4x faster than`stats::dist`

in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g.`fdist(mat, mat[1, ])`

is the same as`sqrt(colSums((t(mat) - mat[1, ])^2)))`

, but about 20x faster in serial mode, and`fdist(x, y)`

is the same as`sqrt(sum((x-y)^2))`

, about 3x faster in serial mode. In both cases (sub-column level) multithreading is available.*Note*that`fdist`

does not skip missing values i.e.`NA`

’s will result in`NA`

distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices.Added function

`GRPid()`

to easily fetch the group id from a grouping object, especially inside grouped`fmutate()`

calls. This addition was warranted especially by the new improved`fnth.default()`

method which allows orderings to be supplied for performance improvements. See commends on`fnth()`

and the example provided below.`fsummarize()`

was added as a synonym to`fsummarise`

. Thanks @arthurgailes for the PR.**C API**:*collapse*exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of`src/ExportSymbols.c`

. The API does not include the*Fast Statistical Functions*, which I thought are too closely related to how*collapse*works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.

`fnth()`

and`fmedian()`

were rewritten in C, with significant gains in performance and versatility. Notably,`fnth()`

now supports (grouped, weighted) continuous quantile estimation like`fquantile()`

(`fmedian()`

, which is a wrapper around`fnth()`

, can also estimate various quantile based weighted medians). The new default for`fnth()`

is`ties = "q7"`

, which gives the same result as`(f)quantile(..., type = 7)`

(R’s default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally,`fnth.default`

gained an additional argument`o`

to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:`# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed. %>% fgroup_by(cyl, vs, am) %>% mtcars fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid() fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt), mpg_median = fmedian(mpg, o = o, w = wt), mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt)) # Note that without weights this is not always faster. Quickselect can be very efficient, so it depends # on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...`

`BY`

now supports data-length arguments to be passed e.g.`BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt)`

, making it effectively a generic grouped`mapply`

function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1.`collap()`

, which internally uses`BY`

with non-*Fast Statistical Functions*, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions with`collap()`

. Furthermore, the parsing of the`FUN`

,`catFUN`

and`wFUN`

arguments was improved and brought in-line with the parsing of`.fns`

in`across()`

. The main benefit of this is that*Fast Statistical Functions*are now also detected and optimizations carried out when passed in a list providing a new name e.g.`collap(data, ~ id, list(mean = fmean))`

is now optimized! Thanks @ttrodrigz (#358) for requesting this.`descr()`

, by virtue of`fquantile`

and the improvements to`BY`

, supports full-blown grouped and weighted descriptions of data. This is implemented through additional`by`

and`w`

arguments. The function has also been turned into an S3 generic, with a default and a ‘grouped_df’ method. The ‘descr’ methods`as.data.frame`

and`print`

also feature various improvements, and a new`compact`

argument to`print.descr`

, allowing a more compact printout. Users will also notice improved performance, mainly due to`fquantile`

: on the M1`descr(wlddev)`

is now 2x faster than`summary(wlddev)`

, and 41x faster than`Hmisc::describe(wlddev)`

. Thanks @statzhero for the request (#355).`radixorder`

is about 25% faster on characters and doubles. This also benefits grouping performance. Note that`group()`

may still be substantially faster on unsorted data, so if performance is critical try the`sort = FALSE`

argument to functions like`fgroup_by`

and compare.Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to

*collapse*’s version of`rbindlist()`

(used in`unlist2d()`

, and various other places).`fsummarise`

and`fmutate`

gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to use`across()`

. For example:`mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))`

or`mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE))`

. There is also the possibility to compute expressions using`.data`

e.g.`mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE))`

yields the same thing, but is less efficient because the whole dataset (including ‘cyl’) is split by groups. For greater efficiency and convenience, you can pre-select columns using a global`.cols`

argument, e.g.`mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb))`

gives the same as above. Three*Notes*about this:- No grouped vectorizations for fast statistical functions i.e. the
entire expression is evaluated for each group. (Let me know if there are
applications where vectorization would be possible and beneficial. I
can’t think of any.)

- All elements in the result list need to have the same length, or,
for
`fmutate`

, have the same length as the data (in each group). - If
`.data`

is used, the entire expression (`expr`

) will be turned into a function of`.data`

(`function(.data) expr`

), which means columns are only available when accessed through`.data`

e.g.`.data$col1`

.

- No grouped vectorizations for fast statistical functions i.e. the
entire expression is evaluated for each group. (Let me know if there are
applications where vectorization would be possible and beneficial. I
can’t think of any.)
`fsummarise`

supports computations with mixed result lengths e.g.`mtcars |> fgroup_by(cyl) |> fsummarise(N = GRPN(), mean_mpg = fmean(mpg), quantile_mpg = fquantile(mpg))`

, as long as all computations result in either length 1 or length k vectors, where k is the maximum result length (e.g. for`fquantile`

with default settings k = 5).List extraction function

`get_elem()`

now has an option`invert = TRUE`

(default`FALSE`

) to remove matching elements from a (nested) list. Also the functionality of argument`keep.class = TRUE`

is implemented in a better way, such that the default`keep.class = FALSE`

toggles classes from (non-matched) list-like objects inside the list to be removed.`num_vars()`

has become a bit smarter: columns of class ‘ts’ and ‘units’ are now also recognized as numeric. In general, users should be aware that`num_vars()`

does not regard any R methods defined for`is.numeric()`

, it is implemented in C and simply checks whether objects are of type integer or double, and do not have a class. The addition of these two exceptions now guards against two common cases where`num_vars()`

may give undesirable outcomes. Note that`num_vars()`

is also called in`collap()`

to distinguish between numeric (`FUN`

) and non-numeric (`catFUN`

) columns.Improvements to

`setv()`

and`copyv()`

, making them more robust to borderline cases:`integer(0)`

passed to`v`

does nothing (instead of error), and it is also possible to pass a single real index if`vind1 = TRUE`

i.e. passing`1`

instead of`1L`

does not produce an error.`alloc()`

now works with all types of objects i.e. it can replicate any object. If the input is non-atomic, atomic with length > 1 or`NULL`

, the output is a list of these objects, e.g.`alloc(NULL, 10)`

gives a length 10 list of`NULL`

objects, or`alloc(mtcars, 10)`

gives a list of`mtcars`

datasets. Note that in the latter case the datasets are not deep-copied, so no additional memory is consumed.`missing_cases()`

and`na_omit()`

have gained an argument`prop = 0`

, indicating the proportion of values missing for the case to be considered missing/to be omitted. The default value of`0`

indicates that at least 1 value must be missing. Of course setting`prop = 1`

indicates that all values must be missing. For data frames/lists the checking is done efficiently in C. For matrices this is currently still implemented using`rowSums(is.na(X)) >= max(as.integer(prop * ncol(X)), 1L)`

, so the performance is less than optimal.`missing_cases()`

has an extra argument`count = FALSE`

. Setting`count = TRUE`

returns the case-wise missing value count (by`cols`

).Functions

`frename()`

and`setrename()`

have an additional argument`.nse = TRUE`

, conforming to the default non-standard evaluation of tagged vector expressions e.g.`frename(mtcars, mpg = newname)`

is the same as`frename(mtcars, mpg = "newname")`

. Setting`.nse = FALSE`

allows`newname`

to be a variable holding a name e.g.`newname = "othername"; frename(mtcars, mpg = newname, .nse = FALSE)`

. Another use of the argument is that a (named) character vector can now be passed to the function to rename a (subset of) columns e.g.`cvec = letters[1:3]; frename(mtcars, cvec, cols = 4:6, .nse = FALSE)`

(this works even with`.nse = TRUE`

), and`names(cvec) = c("cyl", "vs", "am"); frename(mtcars, cvec, .nse = FALSE)`

. Furthermore,`setrename()`

now also returns the renamed data invisibly, and`relabel()`

and`setrelabel()`

have also gained similar flexibility to allow (named) lists or vectors of variable labels to be passed.*Note*that these function have no NSE capabilities, so they work essentially like`frename(..., .nse = FALSE)`

.Function

`add_vars()`

became a bit more flexible and also allows single vectors to be added with tags e.g.`add_vars(mtcars, log_mpg = log(mtcars$mpg), STD(mtcars))`

, similar to`cbind`

. However`add_vars()`

continues to not replicate length 1 inputs.Safer multithreading: OpenMP multithreading over parts of the R API is minimized, reducing errors that occurred especially when multithreading across data frame columns. Also the number of threads supplied by the user to all OpenMP enabled functions is ensured to not exceed either of

`omp_get_num_procs()`

,`omp_get_thread_limit()`

, and`omp_get_max_threads()`

.

Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).

`.pseries`

/`.indexed_series`

methods also change the implicit class of the vector (attached after`"pseries"`

), if the data type changed. e.g. calling a function like`fgrowth`

on an integer pseries changed the data type to double, but the “integer” class was still attached after “pseries”.Fixed bad testing for SE inputs in

`fgroup_by()`

and`findex_by()`

. See #320.Added

`rsplit.matrix`

method.`descr()`

now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA’s detailed summary statistics), and can also be applied to ‘pseries’ / ‘indexed_series’. Furthermore,`descr()`

itself now has an argument`stepwise`

such that`descr(big_data, stepwise = TRUE)`

yields computation of summary statistics on a variable-by-variable basis (and the finished ‘descr’ object is returned invisibly). The printed result is thus identical to`print(descr(big_data), stepwise = TRUE)`

, with the difference that the latter first does the entire computation whereas the former computes statistics on demand.

Function

`ss()`

has a new argument`check = TRUE`

. Setting`check = FALSE`

allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers.Function

`get_vars()`

has a new argument`rename`

allowing select-renaming of columns in standard evaluation programming, e.g.`get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE)`

. The default is`rename = FALSE`

, to warrant full backwards compatibility. See #327.Added helper function

`setattrib()`

, to set a new attribute list for an object by reference + invisible return. This is different from the existing function`setAttrib()`

(note the capital A), which takes a shallow copy of list-like objects and returns the result.

`flm`

and`fFtest`

are now internal generic with an added formula method e.g.`flm(mpg ~ hp + carb, mtcars, weights = wt)`

or`fFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt)`

in addition to the programming interface. Thanks to Grant McDermott for suggesting.Added method

`as.data.frame.qsu`

, to efficiently turn the default array outputs from`qsu()`

into tidy data frames.Major improvements to

`setv`

and`copyv`

, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such as`X[v] <- R`

can now be done significantly faster using`setv(X, v, R)`

.`n`

and`qtab`

can now be added to`options("collapse_mask")`

e.g.`options(collapse_mask = c("manip", "helper", "n", "qtab"))`

. This will export a function`n()`

to get the (group) count in`fsummarise`

and`fmutate`

(which can also always be done using`GRPN()`

but`n()`

is more familiar to*dplyr*users), and will mask`table()`

with`qtab()`

, which is principally a fast drop-in replacement, but with some different further arguments.Added C-level helper function

`all_funs`

, which fetches all the functions called in an expression, similar to`setdiff(all.names(x), all.vars(x))`

but better because it takes account of the syntax. For example let`x = quote(sum(sum))`

i.e. we are summing a column named`sum`

. Then`all.names(x) = c("sum", "sum")`

and`all.vars(x) = "sum"`

so that the difference is`character(0)`

, whereas`all_funs(x)`

returns`"sum"`

. This function makes*collapse*smarter when parsing expressions in`fsummarise`

and`fmutate`

and deciding which ones to vectorize.

Fixed a bug in

`fscale.pdata.frame`

where the default C++ method was being called instead of the list method (i.e. the method didn’t work at all).Fixed 2 minor rchk issues (the remaining ones are spurious).

`fsum`

has an additional argument`fill = TRUE`

(default`FALSE`

) that initializes the result vector with`0`

instead of`NA`

when`na.rm = TRUE`

, so that`fsum(NA, fill = TRUE)`

gives`0`

like`base::sum(NA, na.rm = TRUE)`

.Slight performance increase in

`fmean`

with groups if`na.rm = TRUE`

(the default).Significant performance improvement when using base R expressions involving multiple functions and one column e.g.

`mid_col = (min(col) + max(col)) / 2`

or`lorentz_col = cumsum(sort(col)) / sum(col)`

etc. inside`fsummarise`

and`fmutate`

. Instead of evaluating such expressions on a data subset of one column for each group, they are now turned into a function e.g.`function(x) cumsum(sort(x)) / sum(x)`

which is applied to a single vector split by groups.`fsummarise`

now also adds groupings to transformation functions and operators, which allows full vectorization of more complex tasks involving transformations which are subsequently aggregated. A prime example is grouped bivariate linear model fitting, which can now be done using`mtcars |> fgroup_by(cyl) |> fsummarise(slope = fsum(W(mpg), hp) / fsum(W(mpg)^2))`

. Before 1.8.7 it was necessary to do a mutate step first e.g.`mtcars |> fgroup_by(cyl) |> fmutate(dm_mpg = W(mpg)) |> fsummarise(slope = fsum(dm_mpg, hp) / fsum(dm_mpg^2))`

, because`fsummarise`

did not add groupings to transformation functions like`fwithin/W`

. Thanks to Brodie Gaslam for making me aware of this.Argument

`return.groups`

from`GRP.default`

is now also available in`fgroup_by`

, allowing grouped data frames without materializing the unique grouping columns. This allows more efficient mutate-only operations e.g.`mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmutate(across(hp:carb, fscale))`

. Similarly for aggregation with dropping of grouping columns`mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmean()`

is equivalent and faster than`mtcars |> fgroup_by(cyl) |> fmean(keep.group_vars = FALSE)`

.

- Fixed further minor issues:
- some inline functions in TRA.c needed to be declared ‘static’ to be local in scope (#275)
- timeid.Rd now uses
*zoo*package conditionally and limits size of printout

- Fixed some issues flagged by CRAN:
- Installation on some linux distributions failed because omp.h was included after Rinternals.h
- Some signed integer overflows while running tests caused UBSAN warnings. (This happened inside a hash function where overflows are not a problem. I changed to unsigned int to avoid the UBSAN warning.)
- Ensured that package passes R CMD Check without suggested packages

- Makevars text substitution hack to have CRAN accept a package that combines C, C++ and OpenMP. Thanks also to @MichaelChirico for pointing me in the right direction.

Significant speed improvement in

`qF/qG`

(factor-generation) for character vectors with more than 100,000 obs and many levels if`sort = TRUE`

(the default). For details see the`method`

argument of`?qF`

.Optimizations in

`fmode`

and`fndistinct`

for singleton groups.

Fixed some rchk issues found by Thomas Kalibera from CRAN.

faster

`funique.default`

method.`group`

now also internally optimizes on ‘qG’ objects.

Added function

`fnunique`

(yet another alternative to`data.table::uniqueN`

,`kit::uniqLen`

or`dplyr::n_distinct`

, and principally a simple wrapper for`attr(group(x), "N.groups")`

). At present`fnunique`

generally outperforms the others on data frames.`finteraction`

has an additional argument`factor = TRUE`

. Setting`factor = FALSE`

returns a ‘qG’ object, which is more efficient if just an integer id but no factor object itself is required.Operators (see

`.OPERATOR_FUN`

) have been improved a bit such that id-variables selected in the`.data.frame`

(`by`

,`w`

or`t`

arguments) or`.pdata.frame`

methods (variables in the index) are not computed upon even if they are numeric (since the default is`cols = is.numeric`

). In general, if`cols`

is a function used to select columns of a certain data type, id variables are excluded from computation even if they are of that data type. It is still possible to compute on id variables by explicitly selecting them using names or indices passed to`cols`

, or including them in the lhs of a formula passed to`by`

.Further efforts to facilitate adding the group-count in

`fsummarise`

and`fmutate`

:- if
`options(collapse_mask = "all")`

before loading the package, an additional function`n()`

is exported that works just like`dplyr:::n()`

. - otherwise the same can now always be done using
`GRPN()`

. The previous uses of`GRPN`

are unaltered i.e.`GRPN`

can also:- fetch group sizes directly grouping object or grouped data frame
i.e.
`data |> gby(id) |> GRPN()`

or`data %>% gby(id) %>% ftransform(N = GRPN(.))`

(note the dot). - compute group sizes on the fly, for example
`fsubset(data, GRPN(id) > 10L)`

or`fsubset(data, GRPN(list(id1, id2)) > 10L)`

or`GRPN(data, by = ~ id1 + id2)`

.

- fetch group sizes directly grouping object or grouped data frame
i.e.

- if

*collapse* 1.8.0, released mid of May 2022, brings enhanced
support for indexed computations on time series and panel data by
introducing flexible ‘indexed_frame’ and ‘indexed_series’ classes and
surrounding infrastructure, sets a modest start to OpenMP multithreading
as well as data transformation by reference in statistical functions,
and enhances the packages descriptive statistics toolset.

Functions

`Recode`

,`replace_non_finite`

, depreciated since*collapse*v1.1.0 and`is.regular`

, depreciated since*collapse*v1.5.1 and clashing with a more important function in the*zoo*package, are now removed.*Fast Statistical Functions*operating on numeric data (such as`fmean`

,`fmedian`

,`fsum`

,`fmin`

,`fmax`

, …) now preserve attributes in more cases. Previously these functions did not preserve attributes for simple computations using the default method, and only preserved attributes in grouped computations if`!is.object(x)`

(see NEWS section for collapse 1.4.0). This meant that`fmin`

and`fmax`

did not preserve the attributes of Date or POSIXct objects, and none of these functions preserved ‘units’ objects (used a lot by the*sf*package). Now, attributes are preserved if`!inherits(x, "ts")`

, that is the new default of these functions is to generally keep attributes, except for ‘ts’ objects where doing so obviously causes an unwanted error (note that ‘xts’ and others are handled by the matrix or data.frame method where other principles apply, see NEWS for 1.4.0). An exception are the functions`fnobs`

and`fndistinct`

where the previous default is kept.*Time Series Functions*`flag`

,`fdiff`

,`fgrowth`

and`psacf/pspacf/psccf`

(and the operators`L/F/D/Dlog/G`

) now internally process time objects passed to the`t`

argument (where`is.object(t) && is.numeric(unclass(t))`

) via a new function called`timeid`

which turns them into integer vectors based on the greatest common divisor (GCD) (see below). Previously such objects were converted to factor. This can change behavior of code e.g. a ‘Date’ variable representing monthly data may be regular when converted to factor, but is now irregular and regarded as daily data (with a GCD of 1) because of the different day counts of the months. Users should fix such code by either by calling`qG`

on the time variable (for grouping / factor-conversion) or using appropriate classes e.g.`zoo::yearmon`

. Note that plain numeric vectors where`!is.object(t)`

are still used directly for indexation without passing them through`timeid`

(which can still be applied manually if desired).`BY`

now has an argument`reorder = TRUE`

, which casts elements in the original order if`NROW(result) == NROW(x)`

(like`fmutate`

). Previously the result was just in order of the groups, regardless of the length of the output. To obtain the former outcome users need to set`reorder = FALSE`

.`options("collapse_DT_alloccol")`

was removed, the default is now fixed at 100. The reason is that*data.table*automatically expands the range of overallocated columns if required (so the option is not really necessary), and calling R options from C slows down C code and can cause problems in parallel code.

Fixed a bug in

`fcumsum`

that caused a segfault during grouped operations on larger data, due to flawed internal memory allocation. Thanks @Gulde91 for reporting #237.Fixed a bug in

`across`

caused by two`function(x)`

statements being passed in a list e.g.`mtcars |> fsummarise(acr(mpg, list(ssdd = function(x) sd(x), mu = function(x) mean(x))))`

. Thanks @trang1618 for reporting #233.Fixed an issue in

`across()`

when logical vectors were used to select column on grouped data e.g.`mtcars %>% gby(vs, am) %>% smr(acr(startsWith(names(.), "c"), fmean))`

now works without error.`qsu`

gives proper output for length 1 vectors e.g.`qsu(1)`

.*collapse*depends on R > 3.3.0, due to the use of newer C-level macros introduced then. The earlier indication of R > 2.1.0 was only based on R-level code and misleading. Thanks @ben-schwen for reporting #236. I will try to maintain this dependency for as long as possible, without being too restrained by development in R’s C API and the ALTREP system in particular, which*collapse*might utilize in the future.

Introduction of ‘indexed_frame’,‘indexed_series’ and ‘index_df’ classes: fast and flexible indexed time series and panel data classes that inherit from

*plm*’s ‘pdata.frame’, ‘pseries’ and ‘pindex’ classes. These classes take full advantage of*collapse*’s computational infrastructure, are class-agnostic i.e. they can be superimposed upon any data frame or vector/matrix like object while maintaining most of the functionality of that object, support both time series and panel data, natively handle irregularity, and supports ad-hoc computations inside arbitrary data masking functions and model formulas. This infrastructure comprises of additional functions and methods, and modification of some existing functions and ‘pdata.frame’ / ‘pseries’ methods.New functions:

`findex_by/iby`

,`findex/ix`

,`unindex`

,`reindex`

,`is_irregular`

,`to_plm`

.New methods:

`[.indexed_series`

,`[.indexed_frame`

,`[<-.indexed_frame`

,`$.indexed_frame`

,`$<-.indexed_frame`

,`[[.indexed_frame`

,`[[<-.indexed_frame`

,`[.index_df`

,`fsubset.pseries`

,`fsubset.pdata.frame`

,`funique.pseries`

,`funique.pdata.frame`

,`roworder(v)`

(internal)`na_omit`

(internal),`print.indexed_series`

,`print.indexed_frame`

,`print.index_df`

,`Math.indexed_series`

,`Ops.indexed_series`

.Modification of ‘pseries’ and ‘pdata.frame’ methods for functions

`flag/L/F`

,`fdiff/D/Dlog`

,`fgrowth/G`

,`fcumsum`

,`psmat`

,`psacf/pspacf/psccf`

,`fscale/STD`

,`fbetween/B`

,`fwithin/W`

,`fhdbetween/HDB`

,`fhdwithin/HDW`

,`qsu`

and`varying`

to take advantage of ‘indexed_frame’ and ‘indexed_series’ while continuing to work as before with ‘pdata.frame’ and ‘pseries’.

For more information and details see

`help("indexing")`

.Added function

`timeid`

: Generation of an integer-id/time-factor from time or date sequences represented by integer of double vectors (such as ‘Date’, ‘POSIXct’, ‘ts’, ‘yearmon’, ‘yearquarter’ or plain integers / doubles) by a numerically quite robust greatest common divisor method (see below). This function is used internally in`findex_by`

,`reindex`

and also in evaluation of the`t`

argument to functions like`flag`

/`fdiff`

/`fgrowth`

whenever`is.object(t) && is.numeric(unclass(t))`

(see also note above).Programming helper function

`vgcd`

to efficiently compute the greatest common divisor from a vector or positive integer or double values (which should ideally be unique and sorted as well,`timeid`

uses`vgcd(sort(unique(diff(sort(unique(na_rm(x)))))))`

). Precision for doubles is up to 6 digits.Programming helper function

`frange`

: A significantly faster alternative to`base::range`

, which calls both`min`

and`max`

. Note that`frange`

inherits*collapse*’s global`na.rm = TRUE`

default.Added function

`qtab/qtable`

: A versatile and computationally more efficient alternative to`base::table`

. Notably, it also supports tabulations with frequency weights, and computation of a statistic over combinations of variables. Objects are of class ‘qtab’ that inherits from ‘table’. Thus all ‘table’ methods apply to it.`TRA`

was rewritten in C, and now has an additional argument`set = TRUE`

which toggles data transformation by reference. The function`setTRA`

was added as a shortcut which additionally returns the result invisibly. Since`TRA`

is usually accessed internally through the like-named argument to*Fast Statistical Functions*, passing`set = TRUE`

to those functions yields an internal call to`setTRA`

. For example`fmedian(num_vars(iris), g = iris$Species, TRA = "-", set = TRUE)`

subtracts the species-wise median from the numeric variables in the iris dataset, modifying the data in place and returning the result invisibly. Similarly the argument can be added in other workflows such as`iris |> fgroup_by(Species) |> fmutate(across(1:2, fmedian, set = TRUE))`

or`mtcars |> ftransform(mpg = mpg %+=% hp, wt = fsd(wt, cyl, TRA = "replace_fill", set = TRUE))`

. Note that such chains must be ended by`invisible()`

if no printout is wanted.Exported helper function

`greorder`

, the companion to`gsplit`

to reorder output in`fmutate`

(and now also in`BY`

): let`g`

be a ‘GRP’ object (or something coercible such as a vector) and`x`

a vector, then`greorder`

orders data in`y = unlist(gsplit(x, g))`

such that`identical(greorder(y, g), x)`

.

`fmean`

,`fprod`

,`fmode`

and`fndistinct`

were rewritten in C, providing performance improvements, particularly in`fmode`

and`fndistinct`

, and improvements for integers in`fmean`

and`fprod`

.OpenMP multithreading in

`fsum`

,`fmean`

,`fmedian`

,`fnth`

,`fmode`

and`fndistinct`

, implemented via an additional`nthreads`

argument. The default is to use 1 thread, which internally calls a serial version of the code in`fsum`

and`fmean`

(thus no change in the default behavior). The plan is to slowly roll this out over all statistical functions and then introduce options to set alternative global defaults. Multi-threading internally works different for different functions, see the`nthreads`

argument documentation of a particular function. Unfortunately I currently cannot guarantee thread safety, as parallelization of complex loops entails some tricky bugs and I have limited time to sort these out. So please report bugs, and if you happen to have experience with OpenMP please consider examining the code and making some suggestions.`TRA`

has an additional option`"replace_NA"`

, e.g.`wlddev |> fgroup_by(iso3c) |> fmutate(across(PCGDP:POP, fmedian, TRA = "replace_NA"))`

performs median value imputation of missing values. Similarly for a matrix`X <- matrix(na_insert(rnorm(1e7)), ncol = 100)`

,`fmedian(X, TRA = "replace_NA", set = TRUE)`

(column-wise median imputation by reference).All

*Fast Statistical Functions*support zero group sizes (e.g. grouping with a factor that has unused levels will always produce an output of length`nlevels(x)`

with`0`

or`NA`

elements for the unused levels). Previously this produced an error message with counting/ordinal functions`fmode`

,`fndistinct`

,`fnth`

and`fmedian`

.‘GRP’ objects now also contain a ‘group.starts’ item in the 8’th slot that gives the first positions of the unique groups, and is returned alongside the groups whenever

`return.groups = TRUE`

. This now benefits`ffirst`

when invoked with`na.rm = FALSE`

, e.g.`wlddev %>% fgroup_by(country) %>% ffirst(na.rm = FALSE)`

is now just as efficient as`funique(wlddev, cols = "country")`

. Note that no additional computing cost is incurred by preserving the ‘group.starts’ information.Conversion methods

`GRP.factor`

,`GRP.qG`

,`GRP.pseries`

,`GRP.pdata.frame`

and`GRP.grouped_df`

now also efficiently check if grouping vectors are sorted (the information is stored in the “ordered” element of ‘GRP’ objects). This leads to performance improvements in`gsplit`

/`greorder`

and dependent functions such as`BY`

and`rsplit`

if factors are sorted.`descr()`

received some performance improvements (up to 2x for categorical data), and has an additional argument`sort.table`

, allowing frequency tables for categorical variables to be sorted by frequency (`"freq"`

) or by table values (`"value"`

). The new default is (`"freq"`

), which presents tables in decreasing order of frequency. A method`[.descr`

was added allowing ‘descr’ objects to be subset like a list. The print method was also enhanced, and by default now prints 14 values with the highest frequency and groups the remaining values into a single`... %s Others`

category. Furthermore, if there are any missing values in the column, the percentage of values missing is now printed behind`Statistics`

. Additional arguments`reverse`

and`stepwise`

allow printing in reverse order and/or one variable at a time.`whichv`

(and operators`%==%`

,`%!=%`

) now also support comparisons of equal-length arguments e.g.`1:3 %==% 1:3`

. Note that this should not be used to compare 2 factors.Added some code to the

`.onLoad`

function that checks for the existence of a`.fastverse`

configuration file containing a setting for`_opt_collapse_mask`

: If found the code makes sure that the option takes effect before the package is loaded. This means that inside projects using the*fastverse*and`options("collapse_mask")`

to replace base R /*dplyr*functions,*collapse*cannot be loaded without the masking being applied, making it more secure to utilize this feature. For more information about function masking see`help("collapse-options")`

and for`.fastverse`

configuration files see the fastverse vignette.Added hidden

`.list`

methods for`fhdwithin/HDW`

and`fhdbetween/HDB`

. As for the other`.FAST_FUN`

this is just a wrapper for the data frame method and meant to be used on unclassed data frames.`ss()`

supports unnamed lists / data frames.The

`t`

and`w`

arguments in ‘grouped_df’ methods (NSE) and where formula input is allowed, supports ad-hoc transformations. E.g.`wlddev %>% gby(iso3c) %>% flag(t = qG(date))`

or`L(wlddev, 1, ~ iso3c, ~qG(date))`

, similarly`qsu(wlddev, w = ~ log(POP))`

,`wlddev %>% gby(iso3c) %>% collapg(w = log(POP))`

or`wlddev %>% gby(iso3c) %>% nv() %>% fmean(w = log(POP))`

.Small improvements to

`group()`

algorithm, avoiding some cases where the hash function performed badly, particularly with integers.Function

`GRPnames`

now has a`sep`

argument to choose a separator other than`"."`

.

Corrected a C-level bug in

`gsplit`

that could lead R to crash in some instances (`gsplit`

is used internally in`fsummarise`

,`fmutate`

,`BY`

and`collap`

to perform computations with base R (non-optimized) functions).Ensured that

`BY.grouped_df`

always (by default) returns grouping columns in aggregations i.e.`iris |> gby(Species) |> nv() |> BY(sum)`

now gives the same as`iris |> gby(Species) |> nv() |> fsum()`

.A

`.`

was added to the first argument of functions`fselect`

,`fsubset`

,`colorder`

and`fgroup_by`

, i.e.`fselect(x, ...) -> fselect(.x, ...)`

. The reason for this is that over time I added the option to select-rename columns e.g.`fselect(mtcars, cylinders = cyl)`

, which was not offered when these functions were created. This presents problems if columns should be renamed into`x`

, e.g.`fselect(mtcars, x = cyl)`

failed, see #221. Renaming the first argument to`.x`

somewhat guards against such situations. I think this change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. I really hope this breaks nobody’s code.Added a function

`GRPN`

to make it easy to add a column of group sizes e.g.`mtcars %>% fgroup_by(cyl,vs,am) %>% ftransform(Sizes = GRPN(.))`

or`mtcars %>% ftransform(Sizes = GRPN(list(cyl, vs, am)))`

or`GRPN(mtcars, by = ~cyl+vs+am)`

.Added

`[.pwcor`

and`[.pwcov`

, to be able to subset correlation/covariance matrices without loosing the print formatting.

Also ensuring tidyverse examples are in

`\donttest{}`

and building without the*dplyr*testing file to avoid issues with static code analysis on CRAN.20-50% Speed improvement in

`gsplit`

(and therefore in`fsummarise`

,`fmutate`

,`collap`

and`BY`

*when invoked with base R functions*) when grouping with`GRP(..., sort = TRUE, return.order = TRUE)`

. To enable this by default, the default for argument`return.order`

in`GRP`

was set to`sort`

, which retains the ordering vector (needed for the optimization). Retaining the ordering vector uses up some memory which can possibly adversely affect computations with big data, but with big data`sort = FALSE`

usually gives faster results anyway, and you can also always set`return.order = FALSE`

(also in`fgroup_by`

,`collap`

), so this default gives the best of both worlds.

- An ancient depreciated argument
`sort.row`

(replaced by`sort`

in 2020) is now removed from`collap`

. Also arguments`return.order`

and`method`

were added to`collap`

providing full control of the grouping that happens internally.

Tests needed to be adjusted for the upcoming release of

*dplyr*1.0.8 which involves an API change in`mutate`

.`fmutate`

will not take over these changes i.e.`fmutate(..., .keep = "none")`

will continue to work like`dplyr::transmute`

. Furthermore, no more tests involving*dplyr*are run on CRAN, and I will also not follow along with any future*dplyr*API changes.The C-API macro

`installTrChar`

(used in the new`massign`

function) was replaced with`installChar`

to maintain backwards compatibility with R versions prior to 3.6.0. Thanks @tedmoorman #213.Minor improvements to

`group()`

, providing increased performance for doubles and also increased performance when the second grouping variable is integer, which turned out to be very slow in some instances.

Removed tests involving the

*weights*package (which is not available on R-devel CRAN checks).`fgroup_by`

is more flexible, supporting computing columns e.g.`fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10)`

and various programming options e.g.`fgroup_by(GGDC10S, 1:3)`

,`fgroup_by(GGDC10S, c("Variable", "Country"))`

, or`fgroup_by(GGDC10S, is.character)`

. You can also use column sequences e.g.`fgroup_by(GGDC10S, Country:Variable, Year)`

, but this should not be mixed with computing columns. Compute expressions may also not include the`:`

function.More memory efficient attribute handling in C/C++ (using C-API macro

`SHALLOW_DUPLICATE_ATTRIB`

instead of`DUPLICATE_ATTRIB`

) in most places.

Ensured that the base pipe

`|>`

is not used in tests or examples, to avoid errors on CRAN checks with older versions of R.Also adjusted

`psacf`

/`pspacf`

/`psccf`

to take advantage of the faster grouping by`group`

.

Fixed minor C/C++ issues flagged in CRAN checks.

Added option

`ties = "last"`

to`fmode`

.Added argument

`stable.algo`

to`qsu`

. Setting`stable.algo = FALSE`

toggles a faster calculation of the standard deviation, yielding 2x speedup on large datasets.*Fast Statistical Functions*now internally use`group`

for grouping data if both`g`

and`TRA`

arguments are used, yielding efficiency gains on unsorted data.Ensured that

`fmutate`

and`fsummarise`

can be called if*collapse*is not attached.

*collapse* 1.7.0, released mid January 2022, brings major
improvements in the computational backend of the package, its data
manipulation capabilities, and a whole set of new functions that enable
more flexible and memory efficient R programming - significantly
enhancing the language itself. For the vast majority of codes, updating
to 1.7 should not cause any problems.

`num_vars`

is now implemented in C, yielding a massive performance increase over checking columns using`vapply(x, is.numeric, logical(1))`

. It selects columns where`(is.double(x) || is.integer(x)) && !is.object(x)`

. This provides the same results for most common classes found in data frames (e.g. factors and date columns are not numeric), however it is possible for users to define methods for`is.numeric`

for other objects, which will not be respected by`num_vars`

anymore. A prominent example are base R’s ‘ts’ objects i.e.`is.numeric(AirPassengers)`

returns`TRUE`

, but`is.object(AirPassengers)`

is also`TRUE`

so the above yields`FALSE`

, implying - if you happened to work with data frames of ‘ts’ columns - that`num_vars`

will now not select those anymore. Please make me aware if there are other important classes that are found in data frames and where`is.numeric`

returns`TRUE`

.`num_vars`

is also used internally in`collap`

so this might affect your aggregations.In

`flag`

,`fdiff`

and`fgrowth`

, if a plain numeric vector is passed to the`t`

argument such that`is.double(t) && !is.object(t)`

, it is coerced to integer using`as.integer(t)`

and directly used as time variable, rather than applying ordered grouping first. This is to avoid the inefficiency of grouping, and owes to the fact that in most data imported into R with various packages, the time (year) variables are coded as double although they should be integer (I also don’t know of any cases where time needs to be indexed by a non-date variable with decimal places). Note that the algorithm internally handles irregularity in the time variable so this is not a problem. Should this break any code, kindly raise an issue on GitHub.The function

`setrename`

now truly renames objects by reference (without creating a shallow copy). The same is true for`vlabels<-`

(which was rewritten in C) and a new function`setrelabel`

. Thus additional care needs to be taken (with use inside functions etc.) as the renaming will take global effects unless a shallow copy of the data was created by some prior operation inside the function. If in doubt, better use`frename`

which creates a shallow copy.Some improvements to the

`BY`

function, both in terms of performance and security. Performance is enhanced through a new C function`gsplit`

, providing split-apply-combine computing speeds competitive with*dplyr*on a much broader range of R objects. Regarding Security: if the result of the computation has the same length as the original data, names / rownames and grouping columns (for grouped data) are only added to the result object if known to be valid, i.e. if the data was originally sorted by the grouping columns (information recorded by`GRP.default(..., sort = TRUE)`

, which is called internally on non-factor/GRP/qG objects). This is because`BY`

does not reorder data after the split-apply-combine step (unlike`dplyr::mutate`

); data are simply recombined in the order of the groups. Because of this, in general,`BY`

should be used to compute summary statistics (unless data are sorted before grouping). The added security makes this explicit.Added a method

`length.GRP`

giving the length of a grouping object. This could break code calling`length`

on a grouping object before (which just returned the length of the list).Functions renamed in collapse 1.6.0 will now print a message telling you to use the updated names. The functions under the old names will stay around for 1-3 more years.

- The passing of argument
`order`

instead of`sort`

in function`GRP`

(from a very early version of collapse), is now disabled.

- Fixed a bug in some functions using Welfords Online Algorithm
(
`fvar`

,`fsd`

,`fscale`

and`qsu`

) to calculate variances, occurring when initial or final zero weights caused the running sum of weights in the algorithm to be zero, yielding a division by zero and`NA`

as output although a value was expected. These functions now skip zero weights alongside missing weights, which also implies that you can pass a logical vector to the weights argument to very efficiently calculate statistics on a subset of data (e.g. using`qsu`

).

Function

`group`

was added, providing a low-level interface to a new unordered grouping algorithm based on hashing in C and optimized for R’s data structures. The algorithm was heavily inspired by the great`kit`

package of Morgan Jacob, and now feeds into the package through multiple central functions (including`GRP`

/`fgroup_by`

,`funique`

and`qF`

) when invoked with argument`sort = FALSE`

. It is also used in internal groupings performed in data transformation functions such as`fwithin`

(when no factor or ‘GRP’ object is provided to the`g`

argument). The speed of the algorithm is very promising (often superior to`radixorder`

), and it could be used in more places still. I welcome any feedback on its performance on different datasets.Function

`gsplit`

provides an efficient alternative to`split`

based on grouping objects. It is used as a new backend to`rsplit`

(which also supports data frame) as well as`BY`

,`collap`

,`fsummarise`

and`fmutate`

- for more efficient grouped operations with functions external to the package.Added multiple functions to facilitate memory efficient programming (written in C). These include elementary mathematical operations by reference (

`setop`

,`%+=%`

,`%-=%`

,`%*=%`

,`%/=%`

), supporting computations involving integers and doubles on vectors, matrices and data frames (including row-wise operations via`setop`

) with no copies at all. Furthermore a set of functions which check a single value against a vector without generating logical vectors:`whichv`

,`whichNA`

(operators`%==%`

and`%!=%`

which return indices and are significantly faster than`==`

, especially inside functions like`fsubset`

),`anyv`

and`allv`

(`allNA`

was already added before). Finally, functions`setv`

and`copyv`

speed up operations involving the replacement of a value (`x[x == 5] <- 6`

) or of a sequence of values from a equally sized object (`x[x == 5] <- y[x == 5]`

, or`x[ind] <- y[ind]`

where`ind`

could be pre-computed vectors or indices) in vectors and data frames without generating any logical vectors or materializing vector subsets.Function

`vlengths`

was added as a more efficient alternative to`lengths`

(without method dispatch, simply coded in C).Function

`massign`

provides a multivariate version of`assign`

(written in C, and supporting all basic vector types). In addition the operator`%=%`

was added as an efficient multiple assignment operator. (It is called`%=%`

and not`%<-%`

to facilitate the translation of Matlab or Python codes into R, and because the zeallot package already provides multiple-assignment operators (`%<-%`

and`%->%`

), which are significantly more versatile, but orders of magnitude slower than`%=%`

)

Fully fledged

`fmutate`

function that provides functionality analogous to`dplyr::mutate`

(sequential evaluation of arguments, including arbitrary tagged expressions and`across`

statements).`fmutate`

is optimized to work together with the packages*Fast Statistical and Data Transformation Functions*, yielding fast, vectorized execution, but also benefits from`gsplit`

for other operations.`across()`

function implemented for use inside`fsummarise`

and`fmutate`

. It is also optimized for*Fast Statistical and Data Transformation Functions*, but performs well with other functions too. It has an additional arguments`.apply = FALSE`

which will apply functions to the entire subset of the data instead of individual columns, and thus allows for nesting tibbles and estimating models or correlation matrices by groups etc..`across()`

also supports an arbitrary number of additional arguments which are split and evaluated by groups if necessary. Multiple`across()`

statements can be combined with tagged vector expressions in a single call to`fsummarise`

or`fmutate`

. Thus the computational framework is pretty general and similar to*data.table*, although less efficient with big datasets.Added functions

`relabel`

and`setrelabel`

to make interactive dealing with variable labels a bit easier. Note that both functions operate by reference. (Through`vlabels<-`

which is implemented in C. Taking a shallow copy of the data frame is useless in this case because variable labels are attributes of the columns, not of the frame). The only difference between the two is that`setrelabel`

returns the result invisibly.function shortcuts

`rnm`

and`mtt`

added for`frename`

and`fmutate`

.`across`

can also be abbreviated using`acr`

.Added two options that can be invoked before loading of the package to change the namespace:

`options(collapse_mask = c(...))`

can be set to export copies of selected (or all) functions in the package that start with`f`

removing the leading`f`

e.g.`fsubset`

->`subset`

(both`fsubset`

and`subset`

will be exported). This allows masking base R and dplyr functions (even basic functions such as`sum`

,`mean`

,`unique`

etc. if desired) with*collapse*’s fast functions, facilitating the optimization of existing codes and allowing you to work with*collapse*using a more natural namespace. The package has been internally insulated against such changes, but of course they might have major effects on existing codes. Also`options(collapse_F_to_FALSE = FALSE)`

can be invoked to get rid of the lead operator`F`

, which masks`base::F`

(an issue raised by some people who like to use`T`

/`F`

instead of`TRUE`

/`FALSE`

). Read the help page`?collapse-options`

for more information.

Package loads faster (because I don’t fetch functions from some other C/C++ heavy packages in

`.onLoad`

anymore, which implied unnecessary loading of a lot of DLLs).`fsummarise`

is now also fully featured supporting evaluation of arbitrary expressions and`across()`

statements. Note that mixing*Fast Statistical Functions*with other functions in a single expression can yield unintended outcomes, read more at`?fsummarise`

.`funique`

benefits from`group`

in the default`sort = FALSE`

, configuration, providing extra speed and unique values in first-appearance order in both the default and the data frame method, for all data types.Function

`ss`

supports both empty`i`

or`j`

.The printout of

`fgroup_by`

also shows minimum and maximum group size for unbalanced groupings.In

`ftransformv/settransformv`

and`fcomputev`

, the`vars`

argument is also evaluated inside the data frame environment, allowing NSE specifications using column names e.g.`ftransformv(data, c(col1, col2:coln), FUN)`

.`qF`

with option`sort = FALSE`

now generates factors with levels in first-appearance order (instead of a random order assigned by the hash function), and can also be called on an existing factor to recast the levels in first-appearance order. It is also faster with`sort = FALSE`

(thanks to`group`

).`finteraction`

has argument`sort = FALSE`

to also take advantage of`group`

.`rsplit`

has improved performance through`gsplit`

, and an additional argument`use.names`

, which can be used to return an unnamed list.Speedup in

`vtypes`

and functions`num_vars`

,`cat_vars`

,`char_vars`

,`logi_vars`

and`fact_vars`

. Note than`num_vars`

behaves slightly differently as discussed above.`vlabels(<-)`

/`setLabels`

rewritten in C, giving a ~20x speed improvement. Note that they now operate by reference.`vlabels`

,`vclasses`

and`vtypes`

have a`use.names`

argument. The default is`TRUE`

(as before).`colorder`

can rename columns on the fly and also has a new mode`pos = "after"`

to place all selected columns after the first selected one, e.g.:`colorder(mtcars, cyl, vs_new = vs, am, pos = "after")`

. The`pos = "after"`

option was also added to`roworderv`

.`add_stub`

and`rm_stub`

have an additional`cols`

argument to apply a stub to certain columns only e.g.`add_stub(mtcars, "new_", cols = 6:9)`

.`namlab`

has additional arguments`N`

and`Ndistinct`

, allowing to display number of observations and distinct values next to variable names, labels and classes, to get a nice and quick overview of the variables in a large dataset.`copyMostAttrib`

only copies the`"row.names"`

attribute when known to be valid.`na_rm`

can now be used to efficiently remove empty or`NULL`

elements from a list.`flag`

,`fdiff`

and`fgrowth`

produce less messages (i.e. no message if you don’t use a time variable in grouped operations, and messages about computations on highly irregular panel data only if data length exceeds 10 million obs.).The print methods of

`pwcor`

and`pwcov`

now have a`return`

argument, allowing users to obtain the formatted correlation matrix, for exporting purposes.`replace_NA`

,`recode_num`

and`recode_char`

have improved performance and an additional argument`set`

to take advantage of`setv`

to change (some) data by reference. For`replace_NA`

, this feature is mature and setting`set = TRUE`

will modify all selected columns in place and return the data invisibly. For`recode_num`

and`recode_char`

only a part of the transformations are done by reference, thus users will still have to assign the data to preserve changes. In the future, this will be improved so that`set = TRUE`

toggles all transformations to be done by reference.

Use of

`VECTOR_PTR`

in C API now gives an error on R-devel even if`USE_RINTERNALS`

is defined. Thus this patch gets rid of all remaining usage of this macro to avoid errors on CRAN checks using the development version of R.The print method for

`qsu`

now uses an apostrophe (’) to designate million digits, instead of a comma (,). This is to avoid confusion with the decimal point, and the typical use of (,) for thousands (which I don’t like).

Checks on the gcc11 compiler flagged an additional issue with a pointer pointing to element -1 of a C array (which I had done on purpose to index it with an R integer vector).

CRAN checks flagged a valgrind issue because of comparing an uninitialized value to something.

CRAN maintainers have asked me to remove a line in a Makevars file intended to reduce the size of Rcpp object files (which has been there since version 1.4). So the installed size of the package may now be larger.

A patch for 1.6.0 which fixes issues flagged by CRAN and adds a few handy extras.

Puts examples using the new base pipe

`|>`

inside`\donttest{}`

so that they don’t fail CRAN tests on older R versions.Fixes a LTO issue caused by a small mistake in a header file (which does not have any implications to the user but was detected by CRAN checks).

Added a function

`fcomputev`

, which allows selecting columns and transforming them with a function in one go. The`keep`

argument can be used to add columns to the selection that are not transformed.Added a function

`setLabels`

as a wrapper around`vlabels<-`

to facilitate setting variable labels inside pipes.Function

`rm_stub`

now has an argument`regex = TRUE`

which triggers a call to`gsub`

and allows general removing of character sequences in column names on the fly.

`vlabels<-`

and`setLabels`

now support list of variable labels or other attributes (i.e. the`value`

is internally subset using`[[`

, not`[`

). Thus they are now general functions to attach a vector or list of attributes to columns in a list / data frame.

*collapse* 1.6.0, released end of June 2021, presents some
significant improvements in the user-friendliness, compatibility and
programmability of the package, as well as a few function additions.

`ffirst`

,`flast`

,`fnobs`

,`fsum`

,`fmin`

and`fmax`

were rewritten in C. The former three now also support list columns (where`NULL`

or empty list elements are considered missing values when`na.rm = TRUE`

), and are extremely fast for grouped aggregation if`na.rm = FALSE`

. The latter three also support and return integers, with significant performance gains, even compared to base R. Code using these functions expecting an error for list-columns or expecting double output even if the input is integer should be adjusted.*collapse*now directly supports*sf*data frames through functions like`fselect`

,`fsubset`

,`num_vars`

,`qsu`

,`descr`

,`varying`

,`funique`

,`roworder`

,`rsplit`

,`fcompute`

etc., which will take along the geometry column even if it is not explicitly selected (mirroring*dplyr*methods for*sf*data frames). This is mostly done internally at C-level, so functions remain simple and fast. Existing code that explicitly selects the geometry column is unaffected by the change, but code of the form`sf_data %>% num_vars %>% qDF %>% ...`

, where columns excluding geometry were selected and the object later converted to a data frame, needs to be rewritten as`sf_data %>% qDF %>% num_vars %>% ...`

. A short vignette was added describing the integration of*collapse*and*sf*.I’ve received several requests for increased namespace consistency.

*collapse*functions were named to be consistent with base R,*dplyr*and*data.table*, resulting in names like`is.Date`

,`fgroup_by`

or`settransformv`

. To me this makes sense, but I’ve been convinced that a bit more consistency is advantageous. Towards that end I have decided to eliminate the ‘.’ notation of base R and to remove some unexpected capitalizations in function names giving some people the impression I was using camel-case. The following functions are renamed:`fNobs`

->`fnobs`

,`fNdistinct`

->`fndistinct`

,`pwNobs`

->`pwnobs`

,`fHDwithin`

->`fhdwithin`

,`fHDbetween`

->`fhdbetween`

,`as.factor_GRP`

->`as_factor_GRP`

,`as.factor_qG`

->`as_factor_qG`

,`is.GRP`

->`is_GRP`

,`is.qG`

->`is_qG`

,`is.unlistable`

->`is_unlistable`

,`is.categorical`

->`is_categorical`

,`is.Date`

->`is_date`

,`as.numeric_factor`

->`as_numeric_factor`

,`as.character_factor`

->`as_character_factor`

,`Date_vars`

->`date_vars`

. This is done in a very careful manner, the others will stick around for a long while (end of 2022), and the generics of`fNobs`

,`fNdistinct`

,`fHDbetween`

and`fHDwithin`

will be kept in the package for an indeterminate period, but their core methods will not be exported beyond 2022. I will start warning about these renamed functions in 2022. In the future I will undogmatically stick to a function naming style with lowercase function names and underslashes where words need to be split. Other function names will be kept. To say something about this: The quick-conversion functions`qDF`

`qDT`

,`qM`

,`qF`

,`qG`

are consistent and in-line with*data.table*(`setDT`

etc.), and similarly the operators`L`

,`F`

,`D`

,`Dlog`

,`G`

,`B`

,`W`

,`HDB`

,`HDW`

. I’ll keep`GRP`

,`BY`

and`TRA`

, for lack of better names, parsimony and because they are central to the package. The camel case will be kept in helper functions`setDimnames`

etc. because they work like*stats*`setNames`

and do not modify the argument by reference (like`settransform`

or`setrename`

and various*data.table*functions). Functions`copyAttrib`

and`copyMostAttrib`

are exports of like-named functions in the C API and thus kept as they are. Finally, I want to keep`fFtest`

the way it is because the F-distribution is widely recognized by a capital F.I’ve updated the

`wlddev`

dataset with the latest data from the World Bank, and also added a variable giving the total population (which may be useful e.g. for population-weighted aggregations across regions). The extra column could invalidate codes used to demonstrate something (I had to adjust some examples, tests and code in vignettes).

Added a function

`fcumsum`

(written in C), permitting flexible (grouped, ordered) cumulative summations on matrix-like objects (integer or double typed) with extra methods for grouped data frames and panel series and data frames. Apart from the internal grouping, and an ordering argument allowing cumulative sums in a different order than data appear,`fcumsum`

has 2 options to deal with missing values. The default (`na.rm = TRUE`

) is to skip (preserve) missing values, whereas setting`fill = TRUE`

allows missing values to be populated with the previous value of the cumulative sum (starting from 0).Added a function

`alloc`

to efficiently generate vectors initialized with any value (faster than`rep_len`

).Added a function

`pad`

to efficiently pad vectors / matrices / data.frames with a value (default is`NA`

). This function was mainly created to make it easy to expand results coming from a statistical model fitted on data with missing values to the original length. For example let`data <- na_insert(mtcars); mod <- lm(mpg ~ cyl, data)`

, then we can do`settransform(data, resid = pad(resid(mod), mod$na.action))`

, or we could do`pad(model.matrix(mod), mod$na.action)`

or`pad(model.frame(mod), mod$na.action)`

to receive matrices and data frames from model data matching the rows of`data`

.`pad`

is a general function that will also work with mixed-type data. It is also possible to pass a vector of indices matching the rows of the data to`pad`

, in which case`pad`

will fill gaps in those indices with a value/row in the data.

Full

*data.table*support, including reference semantics (`set*`

,`:=`

)!! There is some complex C-level programming behind*data.table*’s operations by reference. Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the*data.table*, and an`".internal.selfref"`

attribute containing an external pointer is used to check if any shallow copy was made using base R commands like`<-`

. This is done to avoid even a shallow copy of the*data.table*in manipulations using`:=`

(and is in my opinion not worth it as even large tables are shallow copied by base R (>=3.1.0) within microseconds and all of this complicates development immensely). Previously,*collapse*treated*data.table*’s like any other data frame, using shallow copies in manipulations and preserving the attributes (thus ignoring how*data.table*works internally). This produced a warning whenever you wanted to use*data.table*reference semantics (`set*`

,`:=`

) after passing the*data.table*through a*collapse*function such as`collap`

,`fselect`

,`fsubset`

,`fgroup_by`

etc. From v1.6.0, I have adopted essential C code from*data.table*to do the overallocation and generate the`".internal.selfref"`

attribute, thus seamless workflows combining*collapse*and*data.table*are now possible. This comes at a cost of about 2-3 microseconds per function, as to do this I have to shallow copy the*data.table*again and add extra column pointers and an`".internal.selfref"`

attribute telling*data.table*that this table was not copied (it seems to be the only way to do it for now). This integration encompasses all data manipulation functions in*collapse*, but not the*Fast Statistical Functions*themselves. Thus you can do`agDT <- DT %>% fselect(id, col1:coln) %>% collap(~id, fsum); agDT[, newcol := 1]`

, but you would need to do add a`qDT`

after a function like`fsum`

if you want to use reference semantics without incurring a warning:`agDT <- DT %>% fselect(id, col1:coln) %>% fgroup_by(id) %>% fsum %>% qDT; agDT[, newcol := 1]`

.*collapse*appears to be the first package that attempts to account for*data.table*’s internal working without importing*data.table*, and`qDT`

is now the fastest way to create a fully functional*data.table*from any R object. A global option`"collapse_DT_alloccol"`

was added to regulate how many columns*collapse*overallocates when creating*data.table*’s. The default is 100, which is lower than the*data.table*default of 1024. This was done to increase efficiency of the additional shallow copies, and may be changed by the user.Programming enabled with

`fselect`

and`fgroup_by`

(you can now pass vectors containing column names or indices). Note that instead of`fselect`

you should use`get_vars`

for standard eval programming.`fselect`

and`fsubset`

support in-place renaming, e.g.`fselect(data, newname = var1, var3:varN)`

,`fsubset(data, vark > varp, newname = var1, var3:varN)`

.`collap`

supports renaming columns in the custom argument, e.g.`collap(data, ~ id, custom = list(fmean = c(newname = "var1", "var2"), fmode = c(newname = 3), flast = is_date))`

.Performance improvements:

`fsubset`

/`ss`

return the data or perform a simple column subset without deep copying the data if all rows are selected through a logical expression.`fselect`

and`get_vars`

,`num_vars`

etc. are slightly faster through data frame subsetting done fully in C.`ftransform`

/`fcompute`

use`alloc`

instead of`base::rep`

to replicate a scalar value which is slightly more efficient.`fcompute`

now has a`keep`

argument, to preserve several existing columns when computing columns on a data frame.`replace_NA`

now has a`cols`

argument, so we can do`replace_NA(data, cols = is.numeric)`

, to replace`NA`

’s in numeric columns. I note that for big numeric data`data.table::setnafill`

is the most efficient solution.`fhdbetween`

and`fhdwithin`

have an`effect`

argument in*plm*methods, allowing centering on selected identifiers. The default is still to center on all panel identifiers.

The plot method for panel series matrices and arrays

`plot.psmat`

was improved slightly. It now supports custom colours and drawing of a grid.`settransform`

and`settransformv`

can now be called without attaching the package e.g.`collapse::settransform(data, ...)`

. These errored before when*collapse*is not loaded because they are simply wrappers around`data <- ftransform(data, ...)`

. I’d like to note from a discussion that avoiding shallow copies with`<-`

(e.g. via`:=`

) does not appear to yield noticeable performance gains. Where*data.table*is faster on big data this mostly has to do with parallelism and sometimes with algorithms, generally not memory efficiency.Functions

`setAttrib`

,`copyAttrib`

and`copyMostAttrib`

only make a shallow copy of lists, not of atomic vectors (which amounts to doing a full copy and is inefficient). Thus atomic objects are now modified in-place.Small improvements: Calling

`qF(x, ordered = FALSE)`

on an ordered factor will remove the ordered class, the operators`L`

,`F`

,`D`

,`Dlog`

,`G`

,`B`

,`W`

,`HDB`

,`HDW`

and functions like`pwcor`

now work on unnamed matrices or data frames.

- A test that occasionally fails on Mac is removed, and all unit
testing is now removed from CRAN.
*collapse*has close to 10,000 unit tests covering all central pieces of code. Half of these tests depend on generated data, and for some reasons there is always a test or two that occasionally fail on some operating system (usually not Windows), requiring me to submit a patch. This is not constructive to either the development or the use of this package, therefore tests are now removed from CRAN. They are still run on codecov.io, and every new release is thoroughly tested on Windows.

The first argument of

`ftransform`

was renamed to`.data`

from`X`

. This was done to enable the user to transform columns named “X”. For the same reason the first argument of`frename`

was renamed to`.x`

from`x`

(not`.data`

to make it explicit that`.x`

can be any R object with a “names” attribute). It is not possible to depreciate`X`

and`x`

without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.The function

`is.regular`

to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the*zoo*package (#127).

`unlist2d`

produced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or`NULL`

elements in the list-tree are simply ignored (#99).

A function

`fsummarize`

was added to facilitate translating*dplyr*/*data.table*code to*collapse*. Like`collap`

, it is only very fast when used with the*Fast Statistical Functions*.A function

`t_list`

is made available to efficiently transpose lists of lists.

- C files are compiled -O3 on Windows, which gives a boost of around 20% for the grouping mechanism applied to character data.

A small patch for 1.5.0 that:

Fixes a numeric precision issue when grouping doubles (e.g. before

`qF(wlddev$LIFEEX)`

gave an error, now it works).Fixes a minor issue with

`fhdwithin`

when applied to*pseries*and`fill = FALSE`

.

*collapse* 1.5.0, released early January 2021, presents
important refinements and some additional functionality.

- I apologize for inconveniences caused by the temporal archival of
*collapse*from December 19, 2020. This archival was caused by the archival of the important*lfe*package on the 4th of December.*collapse*depended on*lfe*for higher-dimensional centering, providing the`fhdbetween / fhdwithin`

functions for generalized linear projecting / partialling out. To remedy the damage caused by the removal of*lfe*, I had to rewrite`fhdbetween / fhdwithin`

to take advantage of the demeaning algorithm provided by*fixest*, which has some quite different mechanics. Beforehand, I made some significant changes to`fixest::demean`

itself to make this integration happen. The CRAN deadline was the 18th of December, and I realized too late that I would not make this. A request to CRAN for extension was declined, so*collapse*got archived on the 19th. I have learned from this experience, and*collapse*is now sufficiently insulated that it will not be taken off CRAN even if all suggested packages were removed from CRAN.

- Segfaults in several
*Fast Statistical Functions*when passed`numeric(0)`

are fixed (thanks to @eshom and @acylam, #101). The default behavior is that all*collapse*functions return`numeric(0)`

again, except for`fnobs`

,`fndistinct`

which return`0L`

, and`fvar`

,`fsd`

which return`NA_real_`

.

Functions

`fhdwithin / HDW`

and`fhdbetween / HDB`

have been reworked, delivering higher performance and greater functionality: For higher-dimensional centering and heterogeneous slopes, the`demean`

function from the*fixest*package is imported (conditional on the availability of that package). The linear prediction and partialling out functionality is now built around`flm`

and also allows for weights and different fitting methods.In

`collap`

, the default behavior of`give.names = "auto"`

was altered when used together with the`custom`

argument. Before the function name was always added to the column names. Now it is only added if a column is aggregated with two different functions. I apologize if this breaks any code dependent on the new names, but this behavior just better reflects most common use (applying only one function per column), as well as STATA’s collapse.For list processing functions like

`get_elem`

,`has_elem`

etc. the default for the argument`DF.as.list`

was changed from`TRUE`

to`FALSE`

. This means if a nested lists contains data frame’s, these data frame’s will not be searched for matching elements. This default also reflects the more common usage of these functions (extracting entire data frame’s or computed quantities from nested lists rather than searching / subsetting lists of data frame’s). The change also delivers a considerable performance gain.

- Vignettes were outsourced to the website. This nearly halves the size of the source package, and should induce users to appreciate the built-in documentation. The website also makes for much more convenient reading and navigation of these book-style vignettes.

Added a set of 10 operators

`%rr%`

,`%r+%`

,`%r-%`

,`%r*%`

,`%r/%`

,`%cr%`

,`%c+%`

,`%c-%`

,`%c*%`

,`%c/%`

to facilitate and speed up row- and column-wise arithmetic operations involving a vector and a matrix / data frame / list. For example`X %r*% v`

efficiently multiplies every row of`X`

with`v`

. Note that more advanced functionality is already provided in`TRA()`

,`dapply()`

and the*Fast Statistical Functions*, but these operators are intuitive and very convenient to use in matrix or matrix-style code, or in piped expressions.Added function

`missing_cases`

(opposite of`complete.cases`

and faster for data frame’s / lists).Added function

`allNA`

for atomic vectors.New vignette about using

*collapse*together with*data.table*, available online.

- Time series functions and operators
`flag / L / F`

,`fdiff / D / Dlog`

and`fgrowth / G`

now natively support irregular time series and panels, and feature a ‘complete approach’ i.e. values are shifted around taking full account of the underlying time-dimension!

Functions

`pwcor`

and`pwcov`

can now compute weighted correlations on the pairwise or complete observations, supported by C-code that is (conditionally) imported from the*weights*package.`fFtest`

now also supports weights.`collap`

now provides an easy workaround to aggregate some columns using weights and others without. The user may simply append the names of*Fast Statistical Functions*with`_uw`

to disable weights. Example:`collapse::collap(mtcars, ~ cyl, custom = list(fmean_uw = 3:4, fmean = 8:10), w = ~ wt)`

aggregates columns 3 through 4 using a simple mean and columns 8 through 10 using the weighted mean.The parallelism in

`collap`

using`parallel::mclapply`

has been reworked to operate at the column-level, and not at the function level as before. It is still not available for Windows though. The default number of cores was set to`mc.cores = 2L`

, which now gives an error on windows if`parallel = TRUE`

.function

`recode_char`

now has additional options`ignore.case`

and`fixed`

(passed to`grepl`

), for enhanced recoding character data based on regular expressions.`rapply2d`

now has`classes`

argument permitting more flexible use.`na_rm`

and some other internal functions were rewritten in C.`na_rm`

is now 2x faster than`x[!is.na(x)]`

with missing values and 10x faster without missing values.

An improvement to the

`[.GRP_df`

method enabling the use of most*data.table*methods (such as`:=`

) on a grouped*data.table*created with`fgroup_by`

.Some documentation updates by Kevin Tappe.

collapse 1.4.1 is a small patch for 1.4.0 that:

fixes clang-UBSAN and rchk issues in 1.4.0 (minor bugs in compiled code resulting, in this case, from trying to coerce a

`NaN`

value to integer, and failing to protect a shallow copy of a variable).Adds a method

`[.GRP_df`

that allows robust subsetting of grouped objects created with`fgroup_by`

(thanks to Patrice Kiener for flagging this).

*collapse* 1.4.0, released early November 2020, presents some
important refinements, particularly in the domain of attribute handling,
as well as some additional functionality. The changes make
*collapse* smarter, more broadly compatible and more secure, and
should not break existing code.

*Deep Matrix Dispatch / Extended Time Series Support:*The default methods of all statistical and transformation functions dispatch to the matrix method if`is.matrix(x) && !inherits(x, "matrix")`

evaluates to`TRUE`

. This specification avoids invoking the default method on classed matrix-based objects (such as multivariate time series of the*xts*/*zoo*class) not inheriting a ‘matrix’ class, while still allowing the user to manually call the default method on matrices (objects with implicit or explicit ‘matrix’ class). The change implies that*collapse*’s generic statistical functions are now well suited to transform*xts*/*zoo*and many other time series and matrix-based classes.*Fully Non-Destructive Piped Workflow:*`fgroup_by(x, ...)`

now only adds a class*grouped_df*, not classes*table_df*,*tbl*,*grouped_df*, and preserves all classes of`x`

. This implies that workflows such as`x %>% fgroup_by(...) %>% fmean`

etc. yields an object`xAG`

of the same class and attributes as`x`

, not a tibble as before.*collapse*aims to be as broadly compatible, class-agnostic and attribute preserving as possible.

*Thorough and Controlled Object Conversions:*Quick conversion functions`qDF`

,`qDT`

and`qM`

now have additional arguments`keep.attr`

and`class`

providing precise user control over object conversions in terms of classes and other attributes assigned / maintained. The default (`keep.attr = FALSE`

) yields*hard*conversions removing all but essential attributes from the object. E.g. before`qM(EuStockMarkets)`

would just have returned`EuStockMarkets`

(because`is.matrix(EuStockMarkets)`

is`TRUE`

) whereas now the time series class and ‘tsp’ attribute are removed.`qM(EuStockMarkets, keep.attr = TRUE)`

returns`EuStockMarkets`

as before.

*Smarter Attribute Handling:*Drawing on the guidance given in the R Internals manual, the following standards for optimal non-destructive attribute handling are formalized and communicated to the user:The default and matrix methods of the

*Fast Statistical Functions*preserve attributes of the input in grouped aggregations (‘names’, ‘dim’ and ‘dimnames’ are suitably modified). If inputs are classed objects (e.g. factors, time series, checked by`is.object`

), the class and other attributes are dropped. Simple (non-grouped) aggregations of vectors and matrices do not preserve attributes, unless`drop = FALSE`

in the matrix method. An exemption is made in the default methods of functions`ffirst`

,`flast`

and`fmode`

, which always preserve the attributes (as the input could well be a factor or date variable).The data frame methods are unaltered: All attributes of the data frame and columns in the data frame are preserved unless the computation result from each column is a scalar (not computing by groups) and

`drop = TRUE`

(the default).Transformations with functions like

`flag`

,`fwithin`

,`fscale`

etc. are also unaltered: All attributes of the input are preserved in the output (regardless of whether the input is a vector, matrix, data.frame or related classed object). The same holds for transformation options modifying the input (“-”, “-+”, “/”, “+”, “*”, “%%”, “-%%”) when using`TRA()`

function or the`TRA = "..."`

argument to the*Fast Statistical Functions*.For

`TRA`

‘replace’ and ‘replace_fill’ options, the data type of the STATS is preserved, not of x. This provides better results particularly with functions like`fnobs`

and`fndistinct`

. E.g. previously`fnobs(letters, TRA = "replace")`

would have returned the observation counts coerced to character, because`letters`

is character. Now the result is integer typed. For attribute handling this means that the attributes of x are preserved unless x is a classed object and the data types of x and STATS do not match. An exemption to this rule is made if x is a factor and an integer (non-factor) replacement is offered to STATS. In that case the attributes of x are copied exempting the ‘class’ and ‘levels’ attribute, e.g. so that`fnobs(iris$Species, TRA = "replace")`

gives an integer vector, not a (malformed) factor. In the unlikely event that STATS is a classed object, the attributes of STATS are preserved and the attributes of x discarded.

*Reduced Dependency Burden:*The dependency on the*lfe*package was made optional. Functions`fhdwithin`

/`fhdbetween`

can only perform higher-dimensional centering if*lfe*is available. Linear prediction and centering with a single factor (among a list of covariates) is still possible without installing*lfe*. This change means that*collapse*now only depends on base R and*Rcpp*and is supported down to R version 2.10.

Added function

`rsplit`

for efficient (recursive) splitting of vectors and data frames.Added function

`fdroplevels`

for very fast missing level removal + added argument`drop`

to`qF`

and`GRP.factor`

, the default is`drop = FALSE`

. The addition of`fdroplevels`

also enhances the speed of the`fFtest`

function.`fgrowth`

supports annualizing / compounding growth rates through added`power`

argument.A function

`flm`

was added for bare bones (weighted) linear regression fitting using different efficient methods: 4 from base R (`.lm.fit`

,`solve`

,`qr`

,`chol`

), using`fastLm`

from*RcppArmadillo*(if installed), or`fastLm`

from*RcppEigen*(if installed).Added function

`qTBL`

to quickly convert R objects to tibble.helpers

`setAttrib`

,`copyAttrib`

and`copyMostAttrib`

exported for fast attribute handling in R (similar to`attributes<-()`

, these functions return a shallow copy of the first argument with the set of attributes replaced, but do not perform checks for attribute validity like`attributes<-()`

. This can yield large performance gains with big objects).helper

`cinv`

added wrapping the expression`chol2inv(chol(x))`

(efficient inverse of a symmetric, positive definite matrix via Choleski factorization).A shortcut

`gby`

is now available to abbreviate the frequently used`fgroup_by`

function.A print method for grouped data frames of any class was added.

- Faster internal methods for factors for
`funique`

,`fmode`

and`fndistinct`

.

The

*grouped_df*methods for`flag`

,`fdiff`

,`fgrowth`

now also support multiple time variables to identify a panel e.g.`data %>% fgroup_by(region, person_id) %>% flag(1:2, list(month, day))`

.More security features for

`fsubset.data.frame`

/`ss`

,`ss`

is now internal generic and also supports subsetting matrices.In some functions (like

`na_omit`

), passing double values (e.g.`1`

instead of integer`1L`

) or negative indices to the`cols`

argument produced an error or unexpected behavior. This is now fixed in all functions.Fixed a bug in helper function

`all_obj_equal`

occurring if objects are not all equal.Some performance improvements through increased use of pointers and C API functions.

collapse 1.3.2, released mid September 2020:

Fixed a small bug in

`fndistinct`

for grouped distinct value counts on logical vectors.Additional security for

`ftransform`

, which now efficiently checks the names of the data and replacement arguments for uniqueness, and also allows computing and transforming list-columns.Added function

`ftransformv`

to facilitate transforming selected columns with function - a very efficient replacement for`dplyr::mutate_if`

and`dplyr::mutate_at`

.`frename`

now allows additional arguments to be passed to a renaming function.

collapse 1.3.1, released end of August 2020, is a patch for v1.3.0 that takes care of some unit test failures on certain operating systems (mostly because of numeric precision issues). It provides no changes to the code or functionality.

collapse 1.3.0, released mid August 2020:

`dapply`

and`BY`

now drop all unnecessary attributes if`return = "matrix"`

or`return = "data.frame"`

are explicitly requested (the default`return = "same"`

still seeks to preserve the input data structure).`unlist2d`

now saves integer rownames if`row.names = TRUE`

and a list of matrices without rownames is passed, and`id.factor = TRUE`

generates a normal factor not an ordered factor. It is however possible to write`id.factor = "ordered"`

to get an ordered factor id.`fdiff`

argument`logdiff`

renamed to`log`

, and taking logs is now done in R (reduces size of C++ code and does not generate as many NaN’s).`logdiff`

may still be used, but it may be deactivated in the future. Also in the matrix and data.frame methods for`flag`

,`fdiff`

and`fgrowth`

, columns are only stub-renamed if more than one lag/difference/growth rate is computed.

Added

`fnth`

for fast (grouped, weighted) n’th element/quantile computations.Added

`roworder(v)`

and`colorder(v)`

for fast row and column reordering.Added

`frename`

and`setrename`

for fast and flexible renaming (by reference).Added function

`fungroup`

, as replacement for`dplyr::ungroup`

, intended for use with`fgroup_by`

.`fmedian`

now supports weights, computing a decently fast (grouped) weighted median based on radix ordering.`fmode`

now has the option to compute min and max mode, the default is still simply the first mode.`fwithin`

now supports quasi-demeaning (added argument`theta`

) and can thus be used to manually estimate random-effects models.`funique`

is now generic with a default vector and data.frame method, providing fast unique values and rows of data. The default was changed to`sort = FALSE`

.The shortcut

`gvr`

was created for`get_vars(..., regex = TRUE)`

.A helper

`.c`

was introduced for non-standard concatenation (i.e.`.c(a, b) == c("a", "b")`

).

`fmode`

and`fndistinct`

have become a bit faster.`fgroup_by`

now preserves*data.table*’s.`ftransform`

now also supports a data.frame as replacement argument, which automatically replaces matching columns and adds unmatched ones. Also`ftransform<-`

was created as a more formal replacement method for this feature.`collap`

columns selected through`cols`

argument are returned in the order selected if`keep.col.order = FALSE`

. Argument`sort.row`

is depreciated, and replace by argument`sort`

. In addition the`decreasing`

and`na.last`

arguments were added and handed down to`GRP.default`

.`radixorder`

‘sorted’ attribute is now always attached.`stats::D`

which is masked when collapse is attached, is now preserved through methods`D.expression`

and`D.call`

.`GRP`

option`call = FALSE`

to omit a call to`match.call`

-> minor performance improvement.Several small performance improvements through rewriting some internal helper functions in C and reworking some R code.

Performance improvements for some helper functions,

`setRownames`

/`setColnames`

,`na_insert`

etc.Increased scope of testing statistical functions. The functionality of the package is now secured by 7700 unit tests covering all central bits and pieces.

collapse 1.2.1, released end of May 2020:

Minor fixes for 1.2.0 issues that prevented correct installation on Mac OS X and a vignette rebuilding error on solaris.

`fmode.grouped_df`

with groups and weights now saves the sum of the weights instead of the max (this makes more sense as the max only applies if all elements are unique).

collapse 1.2.0, released mid May 2020:

*grouped_df*methods for fast statistical functions now always attach the grouping variables to the output in aggregations, unless argument`keep.group_vars = FALSE`

. (formerly grouping variables were only attached if also present in the data. Code hinged on this feature should be adjusted)`qF`

`ordered`

argument default was changed to`ordered = FALSE`

, and the`NA`

level is only added if`na.exclude = FALSE`

. Thus`qF`

now behaves exactly like`as.factor`

.`Recode`

is depreciated in favor of`recode_num`

and`recode_char`

, it will be removed soon. Similarly`replace_non_finite`

was renamed to`replace_Inf`

.In

`mrtl`

and`mctl`

the argument`ret`

was renamed`return`

and now takes descriptive character arguments (the previous version was a direct C++ export and unsafe, code written with these functions should be adjusted).`GRP`

argument`order`

is depreciated in favor of argument`decreasing`

.`order`

can still be used but will be removed at some point.

- Fixed a bug in
`flag`

where unused factor levels caused a group size error.

Added a suite of functions for fast data manipulation:

`fselect`

selects variables from a data frame and is equivalent but much faster than`dplyr::select`

.`fsubset`

is a much faster version of`base::subset`

to subset vectors, matrices and data.frames. The function`ss`

was also added as a faster alternative to`[.data.frame`

.`ftransform`

is a much faster update of`base::transform`

, to transform data frames by adding, modifying or deleting columns. The function`settransform`

does all of that by reference.`fcompute`

is equivalent to`ftransform`

but returns a new data frame containing only the columns computed from an existing one.`na_omit`

is a much faster and enhanced version of`base::na.omit`

.`replace_NA`

efficiently replaces missing values in multi-type data.

Added function

`fgroup_by`

as a much faster version of`dplyr::group_by`

based on*collapse*grouping. It attaches a ‘GRP’ object to a data frame, but only works with*collapse*’s fast functions. This allows*dplyr*like manipulations that are fully*collapse*based and thus significantly faster, i.e.`data %>% fgroup_by(g1,g2) %>% fselect(cola,colb) %>% fmean`

. Note that`data %>% dplyr::group_by(g1,g2) %>% dplyr::select(cola,colb) %>% fmean`

still works, in which case the*dplyr*‘group’ object is converted to ‘GRP’ as before. However`data %>% fgroup_by(g1,g2) %>% dplyr::summarize(...)`

does not work.Added function

`varying`

to efficiently check the variation of multi-type data over a dimension or within groups.Added function

`radixorder`

, same as`base::order(..., method = "radix")`

but more accessible and with built-in grouping features.Added functions

`seqid`

and`groupid`

for generalized run-length type id variable generation from grouping and time variables.`seqid`

in particular strongly facilitates lagging / differencing irregularly spaced panels using`flag`

,`fdiff`

etc.`fdiff`

now supports quasi-differences i.e. \(x_t - \rho x_{t-1}\) and quasi-log differences i.e. \(log(x_t) - \rho log(x_{t-1})\). an arbitrary \(\rho\) can be supplied.Added a

`Dlog`

operator for faster access to log-differences.

Faster grouping with

`GRP`

and faster factor generation with added radix method + automatic dispatch between hash and radix method.`qF`

is now ~ 5x faster than`as.factor`

on character and around 30x faster on numeric data. Also`qG`

was enhanced.Further slight speed tweaks here and there.

`collap`

now provides more control for weighted aggregations with additional arguments`w`

,`keep.w`

and`wFUN`

to aggregate the weights as well. The defaults are`keep.w = TRUE`

and`wFUN = fsum`

. A specialty of`collap`

remains that`keep.by`

and`keep.w`

also work for external objects passed, so code of the form`collap(data, by, FUN, catFUN, w = data$weights)`

will now have an aggregated`weights`

vector in the first column.

`qsu`

now also allows weights to be passed in formula i.e.`qsu(data, by = ~ group, pid = ~ panelid, w = ~ weights)`

.`fgrowth`

has a`scale`

argument, the default is`scale = 100`

which provides growth rates in percentage terms (as before), but this may now be changed.All statistical and transformation functions now have a hidden list method, so they can be applied to unclassed list-objects as well. An error is however provided in grouped operations with unequal-length columns.

collapse 1.1.0 released early April 2020:

Fixed remaining gcc10, LTO and valgrind issues in C/C++ code, and added some more tests (there are now ~ 5300 tests ensuring that

*collapse*statistical functions perform as expected).Fixed the issue that supplying an unnamed list to

`GRP()`

, i.e.`GRP(list(v1, v2))`

would give an error. Unnamed lists are now automatically named ‘Group.1’, ‘Group.2’, etc…Fixed an issue where aggregating by a single id in

`collap()`

(i.e.`collap(data, ~ id1)`

), the id would be coded as factor in the aggregated data.frame. All variables including id’s now retain their class and attributes in the aggregated data.Added weights (

`w`

) argument to`fsum`

and`fprod`

.Added an argument

`mean = 0`

to`fwithin / W`

. This allows simple and grouped centering on an arbitrary mean,`0`

being the default. For grouped centering`mean = "overall.mean"`

can be specified, which will center data on the overall mean of the data. The logical argument`add.global.mean = TRUE`

used to toggle this in*collapse*1.0.0 is therefore depreciated.Added arguments

`mean = 0`

(the default) and`sd = 1`

(the default) to`fscale / STD`

. These arguments now allow to (group) scale and center data to an arbitrary mean and standard deviation. Setting`mean = FALSE`

will just scale data while preserving the mean(s). Special options for grouped scaling are`mean = "overall.mean"`

(same as`fwithin / W`

), and`sd = "within.sd"`

, which will scale the data such that the standard deviation of each group is equal to the within- standard deviation (= the standard deviation computed on the group-centered data). Thus group scaling a panel-dataset with`mean = "overall.mean"`

and`sd = "within.sd"`

harmonizes the data across all groups in terms of both mean and variance. The fast algorithm for variance calculation toggled with`stable.algo = FALSE`

was removed from`fscale`

. Welford’s numerically stable algorithm used by default is fast enough for all practical purposes. The fast algorithm is still available for`fvar`

and`fsd`

.Added the modulus (

`%%`

) and subtract modulus (`-%%`

) operations to`TRA()`

.Added the function

`finteraction`

, for fast interactions, and`as_character_factor`

to coerce a factor, or all factors in a list, to character (analogous to`as_numeric_factor`

). Also exported the function`ckmatch`

, for matching with error message showing non-matched elements.

First version of the package featuring only the functions

`collap`

and`qsu`

based on code shared by Sebastian Krantz on R-devel, February 2019.Major rework of the package using Rcpp and data.table internals, introduction of fast statistical functions and operators and expansion of the scope of the package to a broad set of data transformation and exploration tasks. Several iterations of enhancing speed of R code. Seamless integration of

*collapse*with*dplyr*,*plm*and*data.table*. CRAN release of*collapse*1.0.0 on 19th March 2020.