BanyanArrays.jl

BanyanArrays.jl is a scalable tool for processing large n-dimensional arrays in the cloud. It currently supports map-reduce computation with multi-dimensional arrays that can be read/written from/to HDF5 datasets on the Internet (e.g., hosted on GitHub) or Amazon S3. The API generally matches that of the Julia standard library Arrays API that Julia users are already familiar with.

Getting Started

To get started with BanyanArrays.jl, follow the steps here to set up Banyan.jl.

Then, open the Julia REPL and press ] to enter "Pkg" (package) mode and run add BanyanArrays. (Ensure you have also added Banyan first.)

Finally, exit the package mode and start a session.

using Banyan, BanyanArrays

start_session(
    cluster_name="parameter-tuning",
    nworkers=128,
    session_name="Resource-Allocation-Model-v2",
    email_when_ready=true
)

Awesome! You can now use the functions described below for massively parallel data processing in this session of 128 workers.

`Array{T,N}`

BanyanArrays.jl provides BanyanArrays.Array{T,N} which aims to be API-compatible with the Julia standard library's Array{T,N}. BanyanArrays.BanyanArray{T,N} is a subtype of AbstractFuture so functions that operate futures can be applied to a BanyanArray.

Creating Arrays

Just like standard library arrays, Banyan arrays can be created with fill, zeros, ones, trues, falses.

You can also convert a Base array into a Banyan array with convert(Banyan.Array, A::AbstractArray) for any Base (non-Banyan) array A.

Reading/Writing Arrays

Arrays can be read from and written to HDF5 datasets with read_hdf5 and write_hdf5. See here for more information.

Array Properties

Just like standard library arrays, various array properties can be accessed with ndims, eltype, length, and size.

Map-Reduce Computation

We support both the map and reduce functions. Note that both map and reduce return futures that can be computed. The unary - operator and the binary operators for + and -.

The map function accepts force_parallelism=false which if set to true will force the computation to be run in parallel even if the array is very small. This is useful for scenarios where you have small data but expensive computation on each element (e.g., parameter tuning, ML).

We also support several common aggregations with sum, minimum, and maximum.

Arrays can be copied with copy and deepcopy.

Currently, only numerical reductions are supported. This is something that will be changed once our Reduce merging function is changed to use a variable-sized version.

Custom Data Types and Functions

A Banyan array can store data of any type - even custom types!

You may define a structure and a function in a separate file:

# user.jl

struct User
    name::String
    age::Integer
end

lifetime_value(u::User) = if u.age > 30 1.0 else 0.0 end

You can then use these in your session:

include("user.jl")

function main()
    start_session(
        cluster_name="ltv-computation",
        nworkers=128,
        session_name="Simple-LTV-Experiment",
        email_when_ready=true,
        code_files=["file://user.jl"]
        force_update_files=true
    )

    users::Base.Vector{User} = [
        read_user(userfile)
        for userfile in readdir("users")
    ]
    users::Banyan.Vector{User} = convert(Banyan.Vector, users)

    lifetime_value = compute(maximum(map(lifetime_value, users)))

    @show lifetime_value
end

main()

Note that if you are writing to HDF5, you are restricted to simple numeric data types.

Example Usage

Once you have a cluster created and you are set up with S3, running scalable array computation with BanyanArrays.jl is as simple as importing BanyanArrays.jl, starting a session, and calling familiar standard library array functions listed above.

with_session(
    cluster_name="mycluster",
    session_name="Summary of Sensor Readings"
    nworkers=4,
) do s
    s4_distance = read_hdf5("s3://sensors/distance.hdf5/s4")
    s5_distance = read_hdf5("s3://sensors/distance.hdf5/s5")
    res = minimum(map((d4, d5) -> (d4-d5)^2, s4_distance, s5_distance))
    @show compute(res)
end

Note that BanyanArrays are futures so they can be computed. Read more about futures and how to compute them here.

A Word From Our Team of Banyaneers

Need a function that isn't listed here? Not sure how to implement your use-case? Please send us an email at support@banyancomputing.com or contact us on the Banyan Users Slack or create a GitHub issue so that we can meet your needs as soon as possible.