Annotating Code

Julia-only

This feature is currently only supported in Julia. Python support coming soon.

Before proceeding into this section on extending Banyan, please note that this is completely optional as you can directly use Banyan with libraries that have already been annotated such as BanyanArrays.jl.

The primary way of extending Banyan is by annotating more code so that data is partitioned and code is automatically scaled to any data size on any cluster.

Banyan can be used to annotate any code from custom scripting code to functions in software libraries. For example, BanyanArrays.jl is the result of annotating functions in the Julia standard library for manipulating multi-dimensional arrays. The cool thing is that you can directly use BanyanArrays.jl without knowing anything about how those functions were annotated.

WIP

Please note that the documentation on Extending Banyan is currently work in progress.

Concepts

Code Regions

The first step of annotating code with Banyan is identifying the specific regions of code that you want to offload and run at scale. In Banyan.jl, each code region to be annotated must be surrounded with a call to the @partitioned macro (we call it that because when you are annotating the code you are basically describing how data used inside the code is partitioned).

# The original code
res = Base.map(f, c...)

# The code surrounded by an @partitioned
@partitioned f c res begin
res = Base.map(f, c...)
end

Partition Types

What are Partition Types?

Annotating code with Banyan is simply a matter of annotating variables with partition types. Partition types can be thought of as similar to data types. Data types describe the layout of data. Partition types describe the partitioning of the data.

# A variable
flow_z

# A data type describing the precision/layout of each
# element and shape of `flow_z`
Array{Float64,3}

# A partition type describing the way that `flow_z` was
# partitioned across nodes and loop iterations. This
# partition type says that `flow_z` is partitioned with
# equally size (balanced) blocks along the first dimension
Distributing{blocked,balanced,1}

How are Partition Types Useful?

When multiple code regions use the same data with non-conflicting partition types, we can do perform an exciting optimization that is sometimes called "fusion". Essentially, we can split up data across different nodes in a cluster and then split up the data again across loop iterations and then execute all code regions one after the other on each small piece of data in each loop iteration.

By using an advanced SMT-solver-based scheduler behind the scenes, we can ensure that we split up data in such a way that we keep everything in memory and in a CPU (L2) cache. This can lead to dramatic performance gains since it is much faster to access memory that is sitting in a CPU cache than sitting in memory and even faster than sitting in disk.

Parameters

To make things more concrete, we represent partition types with dictionaries that map from parameter names to their values. Non-conflicting partition types won't have any parameter names with values that differ.

Constructors

In some cases, deriving a partition type is non-trivial. For example, one may wish to partition a data frame in way that each partition of the data frame has a unique range of values for a certain column. This partitioning is useful for various operations such as sorting by that column, group-by-aggregating that column, etc. It's a useful partitioning because once each partition of the data frame has a unique range for a column, sorting or group-by-aggregation on that column can be simply applied to each partition to achieve the desired result across all partitions.

To describe this partitioning, we might introduce a partition type called Grouped which is parameterized on the column used for grouping. But just the column to group by is not sufficient for us to split and merge data that has been assigned a Grouped partition type. We also need to know the ranges of column values to group data into. These ranges require computing quantiles of, perhaps, a sample of the whole data.

Because some partition type parameters (such as the ranges/quantiles for our Grouped partition type) need to be computed, it is often useful to develop constructors for partition types. Indeed, we have developed several constructors to easily construct a variety of partition types in Banyan.jl.

Bangs

A partitition type parameter can have a value of !, which we call a bang. When a value is a bang, it is automatically replaced by some unique value. This is particularly useful, for example, when you have a Distributed partition type where you have an id parameter that represents a unique distribution of data across workers and loop iterations. Having an id parameter allows us to ensure that when we, for instance, want to select a subset of data frame with an array of boolean, we can - as part of the annotation for the subsetting function - apply a partitioning constraint that requires that the data frame and the array have partition types with the same id.

Now, note that some operations such as filtering produce an entirely new distribution of data (and so we can apply the Drifted() = PartitionType("id" => "!") constructor) while other operations such as a simple mapping transformation preserve the existing distribution of data (with a Match partitioning constraint).

Partition type annotations are used to dispatch splitting functions and these splitting functions also have partition types which define the required partitioning of any data for which the function is used to split. These partition types may also include bangs to indicate that the result of the splitting function will be an entirely new distribution of data across workers and loop iterations.

Partition Annotation

Assignment

The main aspect of annotating code is actually assigning partition types to variables used in a code region. In Banyan.jl, this is done using the partitioned_using and partitioned_with commands.

Constraints

It is sometimes necessary to place additional constraints on the partitioning of data to ensure correctness. For example, it might be important that the number of partitions of a data frame being split is no more than the maximum number of roughly equal-sized groups of data with a unique ranges of values for some column for each group.

There are 2 kinds of constraints that can be enforced:

Partition annotation constraints must be followed for an annotation for a code region
Partition type constraints for a partition type applied to a value. Breaking the constraint requires merging the value and re-splitting with a different partition type.

Co, Cross, Equal, Sequential, Match, MatchOn, AtMost are PA-level constraints. AtMost, ScaleBy are PT-level constraints. For all constraints, contradictions result in the PA not being used. The exception is ScaleBy constraints where multiple PAs can specify a ScaleBy for the same PT and these are not fused in any way; instead, any constraint is taken as a possible scaling of the PT. Of course, the PT parameters specified by the different PAs still have to match.

Cascading

For each code region, you can provide multiple possible PAs. This is important because sometimes there are multiple ways to partition some data.

Steps

There are several steps to annotating a function (code region) to describe how data that flows through the code can be partitioned.

Create `Future`s for results

There are several ways to construct a Future.

`partitioned_with`

This is where you actually construct the partition types and describe how the data can be partitioned.

`@partitioned`

This is how you mark the code region being annotated and specify the variables used in it.