BanyanArrays.jl
BanyanArrays.jl is a scalable tool for processing large n-dimensional arrays in the cloud. It currently supports map-reduce computation with multi-dimensional arrays that can be read/written from/to HDF5 datasets on the Internet (e.g., hosted on GitHub) or Amazon S3. The API generally matches that of the Julia standard library Arrays API that Julia users are already familiar with.
Getting Started
To get started with BanyanArrays.jl, follow the steps here to set up Banyan.jl.
Then, open the Julia REPL and press ]
to enter "Pkg" (package) mode
and run add BanyanArrays
. (Ensure you have also added Banyan
first.)
Finally, exit the package mode and start a session.
using Banyan, BanyanArrays
start_session(
cluster_name="parameter-tuning",
nworkers=128,
session_name="Resource-Allocation-Model-v2",
email_when_ready=true
)
Awesome! You can now use the functions described below for massively parallel data processing in this session of 128 workers.
Array{T,N}
BanyanArrays.jl provides BanyanArrays.Array{T,N}
which aims to be API-compatible
with the Julia standard library's Array{T,N}
. BanyanArrays.BanyanArray{T,N}
is a subtype
of AbstractFuture
so functions that operate futures can be applied to a
BanyanArray
.
Creating Arrays
Just like standard library arrays, Banyan arrays can be created with
fill
,
zeros
,
ones
,
trues
,
falses
.
You can also convert a Base array into a Banyan array with
convert(Banyan.Array, A::AbstractArray)
for any Base (non-Banyan) array A
.
Reading/Writing Arrays
Arrays can be read from and written to HDF5 datasets with read_hdf5
and
write_hdf5
. See here for more information.
Array Properties
Just like standard library arrays, various array properties can be accessed
with
ndims
,
eltype
,
length
,
and size
.
Map-Reduce Computation
We support both the map
and reduce
functions. Note that both map
and reduce
return futures
that can be compute
d. The unary -
operator and the binary operators for +
and -
.
The map
function accepts force_parallelism=false
which
if set to true
will force the computation to be run in parallel even if the array is very small. This is useful
for scenarios where you have small data but expensive computation on each element (e.g., parameter tuning, ML).
We also support several common aggregations with
sum
,
minimum
,
and maximum
.
Arrays can be copied with
copy
and deepcopy
.
Currently, only numerical reductions are supported. This is something that will
be changed once our Reduce
merging function is changed to use a
variable-sized version.
Custom Data Types and Functions
A Banyan array can store data of any type - even custom types!
You may define a structure and a function in a separate file:
# user.jl
struct User
name::String
age::Integer
end
lifetime_value(u::User) = if u.age > 30 1.0 else 0.0 end
You can then use these in your session:
include("user.jl")
function main()
start_session(
cluster_name="ltv-computation",
nworkers=128,
session_name="Simple-LTV-Experiment",
email_when_ready=true,
code_files=["file://user.jl"]
force_update_files=true
)
users::Base.Vector{User} = [
read_user(userfile)
for userfile in readdir("users")
]
users::Banyan.Vector{User} = convert(Banyan.Vector, users)
lifetime_value = compute(maximum(map(lifetime_value, users)))
@show lifetime_value
end
main()
Note that if you are writing to HDF5, you are restricted to simple numeric data types.
Example Usage
Once you have a cluster created and you are set up with S3, running scalable array computation with BanyanArrays.jl is as simple as importing BanyanArrays.jl, starting a session, and calling familiar standard library array functions listed above.
with_session(
cluster_name="mycluster",
session_name="Summary of Sensor Readings"
nworkers=4,
) do s
s4_distance = read_hdf5("s3://sensors/distance.hdf5/s4")
s5_distance = read_hdf5("s3://sensors/distance.hdf5/s5")
res = minimum(map((d4, d5) -> (d4-d5)^2, s4_distance, s5_distance))
@show compute(res)
end
Note that BanyanArray
s are futures so they can be
computed. Read more about futures and how to compute them here.
A Word From Our Team of Banyaneers
Need a function that isn't listed here? Not sure how to implement your use-case? Please send us an email at support@banyancomputing.com or contact us on the Banyan Users Slack or create a GitHub issue so that we can meet your needs as soon as possible.