Computing Futures

Julia-only

This feature is currently only supported in Julia. Python support coming soon.

Banyan is especially effective for data-intensive computation where the ability to seamlessly scale computation to arbitrary data size is valuable. In any data-intensive computation it is critical that data dependencies are maintained even after Banyan optimizes to maximally utilize a clusters' workers and CPU caches. In order to track data dependencies, all data being processed by Banyan must be wrapped in a future.

Futures = "Data yet to be Computed"

A future stores an ID that represents some data that is yet to be computed (i.e., will be computed in the future; hence why we call them futures).

Computing a Future

To actually run a future computation, you can use one of several functions.

`compute`

compute is the most common function you will use for running some computation. It will not only compute some result but also collect it back to the client machine where you are using Banyan (e.g., your laptop or an EC2 instance) and return the result as an actual Julia value.

For collecting a future, you should call compute(future). For example, to compute the sum of your matrix of ones, you would call mysum = sum(myones) to get the sum stored in a future mysum (this does not actually perform the computation; it just saves it for future computation). To actually print out the sum you would first have to collect it with println(compute(mysum)).

Don't worry about intermediate future results. Collecting a future will compute everything needed to return the result on the client side.

Saving to disk can impact performance

Banyan may have to save data to disk (saving to disk can be very slow) if it doesn't know that a future will never be used again (i.e., the future doesn't get garbage collected). You can help Banyan know what futures are no longer needed in one of two ways:

Avoid storing futures representing large datasets in variables.
Mark futures as destroyed by passing in destroy=[whole_dataset, whole_dataset_filtered] to compute.

An example of the latter of these techniques:

data = read_csv(p)
filtered = filter(row -> row.trip_distance < 1.0, data)
grouped = groupby(filtered, :PULocationID)
combined =
    combine(
        grouped,
        :total_amount => mean_func,
        :tip_amount => mean_func,
        :trip_distance => mean_func
    )
# Without the `destroy` parameter below, Banyan will save the futures
# stored in the above variables to disk and make this query take much
# longer to run.
trip_means = compute(combined, destroy=[data, filtered, grouped])

`write_to_disk`

Both Banyan arrays and Banyan data frames can be written to disk on the cluster where code is offloaded to run. While similar functions are frequently used in Dask and Spark for persisting or caching a dataset that will be used throughout a long computation, Banyan doesn't require you to use write_to_disk as an optimization. Banyan automatically figures out how to keep data in memory and in CPU cache as much as possible and writes to disk only when needed.

`write_hdf5`, `write_csv`, `write_parquet`, `write_arrow`

BanyanArrays.jl and BanyanDataFrames.jl export functions that you can use for writing data to a particular location. See the API references for more information. When you write data with these functions, you should pass in a future and this future is then computed and the result written to a location.

Locations

Samples

In Banyan, everything you compute can also be sampled. Samples are automatically maintained and you can access the sample by calling sample on the future. For example, you can read in a data frame with s = read_csv("s3://sales/2020/09) and collect the sampled DataFrame with sample(s). Find out more about sampling your data