arrow 6.0.1

There are now two ways to query Arrow data:

1. Expanded Arrow-native queries: aggregation and joins

dplyr::summarize(), both grouped and ungrouped, is now implemented for Arrow Datasets, Tables, and RecordBatches. Because data is scanned in chunks, you can aggregate over larger-than-memory datasets backed by many files. Supported aggregation functions include n(), n_distinct(), min(), max(), sum(), mean(), var(), sd(), any(), and all(). median() and quantile() with one probability are also supported and currently return approximate results using the t-digest algorithm.

Along with summarize(), you can also call count(), tally(), and distinct(), which effectively wrap summarize().

This enhancement does change the behavior of summarize() and collect() in some cases: see “Breaking changes” below for details.

In addition to summarize(), mutating and filtering equality joins (inner_join(), left_join(), right_join(), full_join(), semi_join(), and anti_join()) with are also supported natively in Arrow.

Grouped aggregation and (especially) joins should be considered somewhat experimental in this release. We expect them to work, but they may not be well optimized for all workloads. To help us focus our efforts on improving them in the next release, please let us know if you encounter unexpected behavior or poor performance.

New non-aggregating compute functions include string functions like str_to_title() and strftime() as well as compute functions for extracting date parts (e.g. year(), month()) from dates. This is not a complete list of additional compute functions; for an exhaustive list of available compute functions see list_compute_functions().

We’ve also worked to fill in support for all data types, such as Decimal, for functions added in previous releases. All type limitations mentioned in previous release notes should be no longer valid, and if you find a function that is not implemented for a certain data type, please report an issue.

2. DuckDB integration

If you have the duckdb package installed, you can hand off an Arrow Dataset or query object to DuckDB for further querying using the to_duckdb() function. This allows you to use duckdb’s dbplyr methods, as well as its SQL interface, to aggregate data. Filtering and column projection done before to_duckdb() is evaluated in Arrow, and duckdb can push down some predicates to Arrow as well. This handoff does not copy the data, instead it uses Arrow’s C-interface (just like passing arrow data between R and Python). This means there is no serialization or data copying costs are incurred.

You can also take a duckdb tbl and call to_arrow() to stream data to Arrow’s query engine. This means that in a single dplyr pipeline, you could start with an Arrow Dataset, evaluate some steps in DuckDB, then evaluate the rest in Arrow.

Breaking changes

Installation on Linux

Other enhancements and fixes

Internals

arrow 5.0.0.2

This patch version contains fixes for some sanitizer and compiler warnings.

arrow 5.0.0

More dplyr

CSV writing

C interface

Other enhancements

arrow 4.0.1

arrow 4.0.0.1

arrow 4.0.0

dplyr methods

Many more dplyr verbs are supported on Arrow objects:

Over 100 functions can now be called on Arrow objects inside a dplyr verb:

Datasets

Other improvements

Installation and configuration

arrow 3.0.0

Python and Flight

Enhancements

Bug fixes

Packaging and installation

arrow 2.0.0

Datasets

AWS S3 support

Flight RPC

Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces. The arrow R package now provides methods for connecting to Flight RPC servers to send and receive data. See vignette("flight", package = "arrow") for an overview.

Computation

Packaging and installation

Bug fixes and other enhancements

arrow 1.0.1

Bug fixes

arrow 1.0.0

Arrow format conversion

Datasets

Other enhancements

Bug fixes and deprecations

Installation and packaging

arrow 0.17.1

arrow 0.17.0

Feather v2

This release includes support for version 2 of the Feather file format. Feather v2 features full support for all Arrow data types, fixes the 2GB per-column limitation for large amounts of string data, and it allows files to be compressed using either lz4 or zstd. write_feather() can write either version 2 or version 1 Feather files, and read_feather() automatically detects which file version it is reading.

Related to this change, several functions around reading and writing data have been reworked. read_ipc_stream() and write_ipc_stream() have been added to facilitate writing data to the Arrow IPC stream format, which is slightly different from the IPC file format (Feather v2 is the IPC file format).

Behavior has been standardized: all read_<format>() return an R data.frame (default) or a Table if the argument as_data_frame = FALSE; all write_<format>() functions return the data object, invisibly. To facilitate some workflows, a special write_to_raw() function is added to wrap write_ipc_stream() and return the raw vector containing the buffer that was written.

To achieve this standardization, read_table(), read_record_batch(), read_arrow(), and write_arrow() have been deprecated.

Python interoperability

The 0.17 Apache Arrow release includes a C data interface that allows exchanging Arrow data in-process at the C level without copying and without libraries having a build or runtime dependency on each other. This enables us to use reticulate to share data between R and Python (pyarrow) efficiently.

See vignette("python", package = "arrow") for details.

Datasets

Installation

Other bug fixes and enhancements

arrow 0.16.0.2

arrow 0.16.0

Multi-file datasets

This release includes a dplyr interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. Explore a directory of data files with open_dataset() and then use dplyr methods to select(), filter(), etc. Work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation. dplyr methods are conditionally loaded if you have dplyr available; it is not a hard dependency.

See vignette("dataset", package = "arrow") for details.

Linux installation

A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.

See vignette("install", package = "arrow") for details.

Data exploration

Compression

Other fixes and improvements

arrow 0.15.1

arrow 0.15.0

Breaking changes

New features

Other upgrades

arrow 0.14.1

Initial CRAN release of the arrow package. Key features include: