BanyanDataFrames.jl

BanyanDataFrames.jl is a scalable data analytics toolkit for developers and data engineers and scientists who are familiar with the Julia programming language.

Getting Started

To get started with BanyanDataFrames.jl, follow the steps here to set up Banyan.jl.

Then, open the Julia REPL and press ] to enter "Pkg" (package) mode and run add BanyanDataFrames. (Ensure you have also added Banyan and BanyanArrays first.)

Finally, exit the package mode and start a session.

using Banyan, BanyanArrays, BanyanDataFrames

start_session(
    cluster_name="julia-sales-etl",
    nworkers=128,
    session_name="Sales-Data-Normalized",
    email_when_ready=true
)

Awesome! You can now use the functions described below for massively parallel data processing in this session of 128 workers.

`DataFrame` and `GroupedDataFrame`

BanyanDataFrames.jl provides a DataFrame and a GroupedDataFrame. Both are futures (subtypes of AbstractFuture) and so functions that apply to futures can be applied to both DataFrames and GroupedDataFrames.

While a DataFrame can be computed, a GroupedDataFrame is a view (it's a view of a data frame) so it cannot be directly compute-ed. Instead, a GroupedDataFrame must first be transform/ select/ combine/ subset-ed into a DataFrame.

Reading and Writing Data Frames

Data frames can be read from and written to CSV, Parquet, and Arrow datasets with read_csv, read_parquet, read_arrow, write_csv, write_parquet, and write_arrow.

Note for using Parquet or CSV

To read/write CSV or Parquet, you must using CSV or using Parquet before or on the same line where you using BanyanDataFrames. Also, since Parquet.jl also exports read_parquet/read_parquet, you must specify BanyanDataFrames.read_parquet/BanyanDataFrames.write_parquet.

Each of these functions accepts a string path describing the location being read from or written to. This path must follow the following criteria:

Path being read from
- Path to file in S3
  - Must begin with s3://
  - Must end with either .csv, .parquet, or .arrow
- Path to directory in S3
  - Must begin with s3://
  - Must contain files that end with either .csv, parquet, or .arrow
- Path to file on the Internet - must being with https:// or http://
Path being written to
- Must be a directory in S3 (not necessarily created yet)
- Must begin with s3://
- Must end with either .csv, parquet, or .arrow

A path that complies with the above can be passed to a reading function or to a writing function:

read_csv("https://www.example.com/archive/17/7/1/logs.csv")

write_arrow(
    likelihood_by_category, 
    "s3://experimentation-cluster-bucket/likelihood_findings.arrow"
)

When reading a data frame, a sample must be collected. Find out how to collect a sample faster and how to preserve cached samples after writing.

Filtering

Data frames can be filtered with filter.

Joins, Sorting, Column Manipulation

We have implemented innerjoin, sort, getindex, setindex!, rename, copy, and deepcopy but we are currently testing these to ensure robustness and performance. Please send us an email at support@banyancomputing.com or contact us on the Banyan Users Slack or create a GitHub issue so that we can prioritize the testing and quality assurance of these features to meet your needs as soon as possible.

Grouping

You can group a DataFrame by one or more columns with groupby to produce a GroupedDataFrame. A GroupedDataFrame can then be transform/ select/ combine/ subset-ed into a DataFrame.

Use combine for group-by-aggregation and subset for filtering groups (similar to HAVING in SQL). You can also aggregate entire columns without grouping.

Columns

Columns are Arrays

Columns are simply Banyan arrays. Once you have accessed a column, you can use Julia's standard library array manipulation functions that have been supported in BanyanArrays.jl such as map and reduce. See the API reference for BanyanArrays.jl for more details on how to manipulate columns of a Banyan data frame.

Getting Columns

Columns can be accessed with getindex(df, :, cols) (or the equivalent short-hand df[:, cols]).

Aggregating Columns

Columns can be aggregated with reduce or other functions in BanyanArrays.jl such as sum, minimum, maximum. You can perform group-by aggregations with groupby and combine.

Properties and Futures

Functions that compute properties of data frames (nrow, ncol, size, ndims, names, and propertynames) and grouped data frames (length, size, ndims, groupcols, and valuecols) immediately return computed results while all other functions in BanyanDataFrames.jl return futures (data frames and grouped data frames are futures).

Futures must be computed to actually run computation and collect a result onto your local "client" machine. You must compute a future data frame before you use it in functions that have not yet been supported as part of BanyanDataFrames.jl. For example, if you want to visualize a data frame produced by BanyanDataFrames.jl, make sure to first compute it:

with_session(
    cluster_name="Customer Data Analytics",
    session_name="Hourly Customer Spendng for Today"
    nworkers=128,
) do s
    # Get the most recent visits from a Parquet dataset stored in an Amazon S3
    # bucket
    visits = BanyanDataFrames.read_parquet("s3://customer_data/visits.parquet")
    most_recent_visits = filter(visit -> visit[:day] == 18, visits)

    # Compute the result and collect it to your local "client" machine (e.g., a
    # laptop)
    V = compute(most_recent_visits)

    # Now we can use Plots and StatsPlots to plot the data!
    @df V scatter(
        :hour,
        :spending
    )
end

A Word From Our Team of Banyaneers

Need a function that isn't listed here? Not sure how to implement your use-case? Please send us an email at support@banyancomputing.com or contact us on the Banyan Users Slack or create a GitHub issue so that we can meet your needs as soon as possible.