Skip to content

BanyanHDF5.jl

HDF5 is one of the most widely adopted storage formats for arrays. BanyanHDF5.jl lets you (a) save Banyan arrays to HDF5 datasets and (b) load HDF5 datasets into Banyan arrays.

Getting Started

To get started with BanyanHDF5.jl, follow the steps here to set up Banyan.jl.

Then, open the Julia REPL and press ] to enter "Pkg" (package) mode and run add BanyanHDF5. (Ensure you have also added Banyan and BanyanArrays first.)

Finally, exit the package mode and start a session.

using Banyan, BanyanArrays, BanyanHDF5

start_session(
    cluster_name="daily-sensor-reading",
    nworkers=128,
    session_name="Variance-Report",
    email_when_ready=true
)

Awesome! You can now use the functions described below for massively parallel data processing in this session of 128 workers.

Reading/Writing HDF5 Datasets

Arrays can be read from and written to HDF5 datasets with read_hdf5 and write_hdf5. Simply pass in the string location to read from or write to. If reading from the Internet, you should prefix the location with http:// or https://. If reading from or writing to Amazon S3, you should prefix the location with s3://. For example:

temps = read_hdf5("https://www.example.com/readings87.hdf5/dset4")

read_hdf5("s3://myaweseomebucket/result.hdf5/readings")

write_hdf5(temps, "s3://myaweseomebucket/result.hdf5/readings")

Please see Using Amazon S3 for instructions on setting up Amazon S3. In order to read from or write to HDF5 datasets in S3, you must have created your cluster with access to the S3 bucket that the HDF5 file you are working with is in.

If you want to read from or write to an HDF5 dataset in an S3 bucket that your cluster was not created with access to, you may need to destroy your cluster and create a new cluster with access to the desired S3 bucket.

When reading an HDF5 dataset, a sample must be collected. Find out how to collect a sample faster and how to preserve cached samples after writing.

Nota Bene About Compilation

Banyan automatically compiles HDF5.jl to have MPI-enabled parallelism if the Project.toml that your project uses has "HDF5" in it. So if you have HDF5.jl imported or even just a comment containing "HDF5", we will compiled parallel HDF5 on your cluster for greater performance.