Defining Partitioning Functions
Julia-only
This feature is currently only supported in Julia. Python support coming soon.
Annotated code regions are lazily scheduled and executed by Banyan behind the scenes. The Banyan scheduler groups code regions into stages where each stage is a set of code regions that can share the same partition types for all the data they process. The scheduler also generates code to first split data across workers and loop iterations at the start of the stage and then to merge data back together at the end of the stage.
But how does the schedule know how to split and merge data? The scheduler does this by dynamically dispatching partitioning functions based on the partition types of data.
WIP
Please note that the documentation on Extending Banyan is currently work in progress.
What are Partitioning Functions?
There are three kinds of partitioning functions:
- Splitting functions split data across workers and/or loop iterations
- Merging functions merge data that was previously split
- Casting functions convert data split across workers from one partition type to another
To create a partitioning function, two things are required:
- A function that implements the splitting/merging/casting
- A description of the requirements of the partition type of any data to be split/merged/casted with this function
Why Write Partitioning Functions?
Partitioning functions are dynamically dispatched based on the partition types of data. This dynamic dispatch allows splitting/merging/casting functions to specialize their implementation for the particular partition type that they require or the location of the data being split/merged.
For example, data with a partition type of Blocked{balanced}
might
be split with a splitting function SplitBlock
that doesn't require
iterating through the data being split and is therefore much less
computationally intensive than a function like SplitGroup
.
Another example data with a location of HDF5
might be split by a splitting function
ReadHDF5
that leverages Parallel HDF5 for platform-optimized reading
and writing of HDF5 datasets parallelized across workers.