Skip to content


Benchmarking Banyan against other libraries and services for large-scale data science is very important to us. It's important because we believe that the benefits of automatic instant data sampling (reduced cloud costs and ecological footprint) should not come at the cost of performance.

Analyzing Billions of Rows (preliminary)

We ran a preliminary benchmark for a groupby aggregation on a large dataset with 10s of GBs of compressed Parquet data requiring TBs of memory.

The code that was run:

df = dd.read_parquet(nyc_taxi_path, ignore_metadata_file=True)
long_trips = df[df.trip_distance < 1.0]
gdf = long_trips.groupby("PULocationID")
agg_cols = ["trip_distance", "total_amount", "tip_amount"]
mean_fn = {c: "mean" for c in agg_cols}
trip_means = gdf.agg(mean_fn).compute()
trip_means = combine(
            row -> row.trip_distance < 1.0,
    :total_amount => mean,
    :tip_amount => mean,
    :trip_distance => mean
) |> compute    
# of columns aggregated Dataset size Dask (Coiled) Banyan
1 13 GB (1B rows) -- 89s
2 13 GB (1B rows) 81s 93s
3 26 GB (2B rows) 155s 180s Database-Like Ops (coming soon)

We are working on implementing the database-like ops benchmark for a complete evaluation of performance and comparison to other libraries for large-scale data science.