Benchmarks

Benchmarking Banyan against other libraries and services for large-scale data science is very important to us. It's important because we believe that the benefits of automatic instant data sampling (reduced cloud costs and ecological footprint) should not come at the cost of performance.

Analyzing Billions of Rows (preliminary)

We ran a preliminary benchmark for a groupby aggregation on a large dataset with 10s of GBs of compressed Parquet data requiring TBs of memory.

The code that was run:

Dask (Coiled)Banyan

df = dd.read_parquet(nyc_taxi_path, ignore_metadata_file=True)
long_trips = df[df.trip_distance < 1.0]
gdf = long_trips.groupby("PULocationID")
agg_cols = ["trip_distance", "total_amount", "tip_amount"]
mean_fn = {c: "mean" for c in agg_cols}
trip_means = gdf.agg(mean_fn).compute()

trip_means = combine(
    groupby(
        filter(
            row -> row.trip_distance < 1.0,
            BanyanDataFrames.read_parquet(nyc_taxi_path)
        ),
        :PULocationID
    ),
    :total_amount => mean,
    :tip_amount => mean,
    :trip_distance => mean
) |> compute

# of columns aggregated	Dataset size	Dask (Coiled)	Banyan
1	13 GB (1B rows)	--	89s
2	13 GB (1B rows)	81s	93s
3	26 GB (2B rows)	155s	180s

H20.ai Database-Like Ops (coming soon)

We are working on implementing the H20.ai database-like ops benchmark for a complete evaluation of performance and comparison to other libraries for large-scale data science.