Defining Locations

Julia-only

This feature is currently only supported in Julia. Python support coming soon.

Every future tracked by Banyan has a location. Every instance of splitting and merging by a partitioning function is either splitting from or merging to a future's location or from the Memory, Disk, or None location.

Data location specifies a source and/or destination for data. Partitioning functions are dynamically dispatched based on the location of data and the location of a future can be specified through the sourced and destined functions. These are useful when annotating functions such as read_hdf5 where the result must be sourced from a location constructed from a path to an HDF5 dataset.

To define a new location, you must do the following:

A location constructor
A partitiong function (PF) for reading from the location
A PF for writing to the location

You can then either use the location yourself or contribute it directly to banyan-julia (or banyan-python):

Fork and clone banyan-julia (or banyan-python)
Create a new package (e.g., BanyanParquet.jl) containing the location constructor, PFs, and reading/writing function
Create a pull request (PR) on the banyan-julia (or banyan-python) GitHub repository

WIP

Please note that the documentation on Extending Banyan is currently work in progress.

What are Locations?

Each future not only represents some data that is yet to be computed but also stores information about the location of the data:

The source (if any) that the data should be split from
The destination (if any) that the data should be merged to
A sample of data to split from

In Banyan.jl, you can construct locations with functions like Value, Size, Client, and Remote. These locations can be assigned to futures with sourced and destined. Once locations are assigned to futures, samples are taken for source locations and these samples are used to estimate various properties of data which help to construct accurate partition types.

Constructors

Locations can be constructed from a Banyan client library (such as Banyan.jl for Banyan Julia). A constructed location contains several important things:

Name of this location as a source (something that can be read from)
Name of this location as a destination (something that can be written to)
Parameters for usage as a source
Parameters for usage as a destination
A sample

The names and parameters are passed into splitting/merging functions dispatched for the future having this location. The source name and parameters are passed to dispatched splitting functions. The destination name and parameters are passed to merging functions.

Samples

When a location is constructed, not only is information about the location collected through name and parameters, but also a sample of the data stored at the location (if any is stored) is collected.

Caching Locations and Samples

Caching refers to the process of storing some result in what's called a cache so that it doesn't have t obe recomputed. A cached item may be invalidated if something has changed causing the item to need to be recomputed.

In Banyan Julia, locations (locations in this context refers to just the non-sample part of the location) and samples are cached in the local file system for users (in the future we will cache items in the S3 bucket for the cluster where the session is running so that different users use the same cache).