Defining Locations
Julia-only
This feature is currently only supported in Julia. Python support coming soon.
Every future tracked by Banyan has a location. Every instance of splitting
and merging by a partitioning function is either splitting from or merging to
a future's location or from the Memory
, Disk
, or None
location.
Data location specifies a source and/or destination for data. Partitioning
functions are dynamically dispatched based on the location of data and the
location of a future can be specified through the sourced
and destined
functions. These are useful when annotating functions such as read_hdf5
where the result must be sourced
from a location constructed from a path
to an HDF5 dataset.
To define a new location, you must do the following:
- A location constructor
- A partitiong function (PF) for reading from the location
- A PF for writing to the location
You can then either use the location yourself or contribute it directly to
banyan-julia
(or banyan-python
):
- Fork and clone
banyan-julia
(orbanyan-python
) - Create a new package (e.g.,
BanyanParquet.jl
) containing the location constructor, PFs, and reading/writing function - Create a pull request (PR) on the
banyan-julia
(orbanyan-python
) GitHub repository
WIP
Please note that the documentation on Extending Banyan is currently work in progress.
What are Locations?
Each future not only represents some data that is yet to be computed but also stores information about the location of the data:
- The source (if any) that the data should be split from
- The destination (if any) that the data should be merged to
- A sample of data to split from
In Banyan.jl, you can construct locations with functions like Value
, Size
,
Client
, and Remote
. These locations can be assigned to futures with
sourced
and destined
. Once locations are assigned to futures, samples are
taken for source locations and these samples are used to estimate various
properties of data which help to construct accurate partition types.
Constructors
Locations can be constructed from a Banyan client library (such as Banyan.jl for Banyan Julia). A constructed location contains several important things:
- Name of this location as a source (something that can be read from)
- Name of this location as a destination (something that can be written to)
- Parameters for usage as a source
- Parameters for usage as a destination
- A sample
The names and parameters are passed into splitting/merging functions dispatched for the future having this location. The source name and parameters are passed to dispatched splitting functions. The destination name and parameters are passed to merging functions.
Samples
When a location is constructed, not only is information about the location collected through name and parameters, but also a sample of the data stored at the location (if any is stored) is collected.
Caching Locations and Samples
Caching refers to the process of storing some result in what's called a cache so that it doesn't have t obe recomputed. A cached item may be invalidated if something has changed causing the item to need to be recomputed.
In Banyan Julia, locations (locations in this context refers to just the non-sample part of the location) and samples are cached in the local file system for users (in the future we will cache items in the S3 bucket for the cluster where the session is running so that different users use the same cache).