HDF5 files as back-end

The Hierarchical Data Format version 5 (HDF5) is an open source file format that supports large, complex, heterogeneous data. This format has different advantages that make it very suitable to store large datasets together with their metadata in a way that allows to access quickly all the information. As digitalDLSorteR needs to simulate large amounts of pseudo-bulk samples to reach a good training and uses as input single-cell RNA-Seq datasets whose size is getting much bigger over time, it implements a set of functionalities that offer the possibility to use HDF5 files as back-end for each step where large data are required:

To use this format, digitalDLSorteR uses mainly the HDF5Array and DelayedArray packages, although some functionalities have been implemented using directly rhdf5. For more information about these packages, we recommend their corresponding vignettes and this workshop by Peter Hickey: Effectively using the DelayedArray framework to support the analysis of large datasets.

General usage

In Building new deconvolution models, some examples of its usage are shown. On the whole, the important parameters that must be considered are the following ones:

The simplest way to use it is by setting just the file.backend parameter as in the examples provided in Building new deconvolution models.


HDF5 files are a very useful tool that allows to work with large datasets that in other way would be impossible. However, it is important to keep in mind that runtimes can be longer when they are used, as to access data from RAM is always faster than from disk. Therefore, we recommend using this functionality only in the case of having very large datasets and limited computational resources. As the HDF5Array and DelayedArray authors point: If you can load your data into memory and still compute on it, then you’re always going to have a better time doing it that way.