Computing Futures
Julia-only
This feature is currently only supported in Julia. Python support coming soon.
Banyan is especially effective for data-intensive computation where the ability to seamlessly scale computation to arbitrary data size is valuable. In any data-intensive computation it is critical that data dependencies are maintained even after Banyan optimizes to maximally utilize a clusters' workers and CPU caches. In order to track data dependencies, all data being processed by Banyan must be wrapped in a future.
Futures = "Data yet to be Computed"
A future stores an ID that represents some data that is yet to be computed (i.e., will be computed in the future; hence why we call them futures).
Computing a Future
To actually run a future computation, you can use one of several functions.
compute
compute
is the most common function you will use for running some computation.
It will not only compute some result but also collect it back to
the client machine where you are using Banyan (e.g., your laptop or an EC2
instance) and return the result as an actual Julia value.
For collecting a future, you should call compute(future)
. For
example, to compute the sum of your matrix of ones, you would call mysum = sum(myones)
to get the sum stored in a future mysum
(this does not actually perform the computation;
it just saves it for future computation). To actually print out the sum you would first
have to collect it with println(compute(mysum))
.
Don't worry about intermediate future results. Collecting a future will compute everything needed to return the result on the client side.
Saving to disk can impact performance
Banyan may have to save data to disk (saving to disk can be very slow) if it doesn't know that a future will never be used again (i.e., the future doesn't get garbage collected). You can help Banyan know what futures are no longer needed in one of two ways:
- Avoid storing futures representing large datasets in variables.
- Mark futures as destroyed by passing in
destroy=[whole_dataset, whole_dataset_filtered]
tocompute
.
An example of the latter of these techniques:
data = read_csv(p)
filtered = filter(row -> row.trip_distance < 1.0, data)
grouped = groupby(filtered, :PULocationID)
combined =
combine(
grouped,
:total_amount => mean_func,
:tip_amount => mean_func,
:trip_distance => mean_func
)
# Without the `destroy` parameter below, Banyan will save the futures
# stored in the above variables to disk and make this query take much
# longer to run.
trip_means = compute(combined, destroy=[data, filtered, grouped])
write_to_disk
Both Banyan arrays and Banyan data frames can be written to disk on the cluster
where code is offloaded to run. While similar functions are frequently
used in Dask and Spark for persisting or caching a dataset that will be
used throughout a long computation, Banyan doesn't require you to use
write_to_disk
as an optimization. Banyan automatically figures out how to
keep data in memory and in CPU cache as much as possible and writes to disk
only when needed.
write_hdf5
, write_csv
, write_parquet
, write_arrow
BanyanArrays.jl and BanyanDataFrames.jl export functions that you can use for writing data to a particular location. See the API references for more information. When you write data with these functions, you should pass in a future and this future is then computed and the result written to a location.
Locations
Samples
In Banyan, everything you compute
can also be sample
d. Samples are automatically
maintained and you can access the sample
by calling sample
on the future. For example, you can read in a data frame
with s = read_csv("s3://sales/2020/09)
and collect the sampled DataFrame
with sample(s)
. Find out more about sampling your data