Skip to content

Benchmarks

Benchmarking Banyan against other libraries and services for large-scale data science is very important to us. It's important because we believe that the benefits of automatic instant data sampling (reduced cloud costs and ecological footprint) should not come at the cost of performance.

Analyzing Billions of Rows (preliminary)

We ran a preliminary benchmark for a groupby aggregation on a large dataset with 10s of GBs of compressed Parquet data requiring TBs of memory.

The code that was run:

df = dd.read_parquet(nyc_taxi_path, ignore_metadata_file=True)
long_trips = df[df.trip_distance < 1.0]
gdf = long_trips.groupby("PULocationID")
agg_cols = ["trip_distance", "total_amount", "tip_amount"]
mean_fn = {c: "mean" for c in agg_cols}
trip_means = gdf.agg(mean_fn).compute()
trip_means = combine(
    groupby(
        filter(
            row -> row.trip_distance < 1.0,
            BanyanDataFrames.read_parquet(nyc_taxi_path)
        ),
        :PULocationID
    ),
    :total_amount => mean,
    :tip_amount => mean,
    :trip_distance => mean
) |> compute    
# of columns aggregated Dataset size Dask (Coiled) Banyan
1 13 GB (1B rows) -- 89s
2 13 GB (1B rows) 81s 93s
3 26 GB (2B rows) 155s 180s

H20.ai Database-Like Ops (coming soon)

We are working on implementing the H20.ai database-like ops benchmark for a complete evaluation of performance and comparison to other libraries for large-scale data science.