BanyanDataFrames.jl
BanyanDataFrames.jl is a scalable data analytics toolkit for developers and data engineers and scientists who are familiar with the Julia programming language.
Getting Started
To get started with BanyanDataFrames.jl, follow the steps here to set up Banyan.jl.
Then, open the Julia REPL and press ]
to enter "Pkg" (package) mode
and run add BanyanDataFrames
. (Ensure you have also added Banyan
and BanyanArrays
first.)
Finally, exit the package mode and start a session.
using Banyan, BanyanArrays, BanyanDataFrames
start_session(
cluster_name="julia-sales-etl",
nworkers=128,
session_name="Sales-Data-Normalized",
email_when_ready=true
)
Awesome! You can now use the functions described below for massively parallel data processing in this session of 128 workers.
DataFrame
and GroupedDataFrame
BanyanDataFrames.jl provides a DataFrame
and a GroupedDataFrame
. Both are
futures (subtypes of AbstractFuture
) and so functions that apply to futures can be applied to both DataFrame
s
and GroupedDataFrame
s.
While a DataFrame
can be compute
d, a GroupedDataFrame
is a view (it's a view of a data frame) so it cannot be directly compute
-ed. Instead,
a GroupedDataFrame
must first be
transform
/
select
/
combine
/
subset
-ed into a DataFrame
.
Reading and Writing Data Frames
Data frames can be read from and written to CSV, Parquet, and Arrow datasets with
read_csv
, read_parquet
, read_arrow
, write_csv
, write_parquet
, and write_arrow
.
Note for using Parquet or CSV
To read/write CSV or Parquet, you must using CSV
or using Parquet
before or on the same line where you using BanyanDataFrames
.
Also, since Parquet.jl also exports read_parquet
/read_parquet
, you must specify BanyanDataFrames.read_parquet
/BanyanDataFrames.write_parquet
.
Each of these functions accepts a string path describing the location being read from or written to. This path must follow the following criteria:
- Path being read from
- Path to file in S3
- Must begin with
s3://
- Must end with either
.csv
,.parquet
, or.arrow
- Must begin with
- Path to directory in S3
- Must begin with
s3://
- Must contain files that end with either
.csv
,parquet
, or.arrow
- Must begin with
- Path to file on the Internet - must being with
https://
orhttp://
- Path to file in S3
- Path being written to
- Must be a directory in S3 (not necessarily created yet)
- Must begin with
s3://
- Must end with either
.csv
,parquet
, or.arrow
A path that complies with the above can be passed to a reading function or to a writing function:
read_csv("https://www.example.com/archive/17/7/1/logs.csv")
write_arrow(
likelihood_by_category,
"s3://experimentation-cluster-bucket/likelihood_findings.arrow"
)
When reading a data frame, a sample must be collected. Find out how to collect a sample faster and how to preserve cached samples after writing.
Filtering
Data frames can be filtered with filter
.
Joins, Sorting, Column Manipulation
We have implemented innerjoin
, sort
, getindex
, setindex!
,
rename
, copy
, and deepcopy
but we are currently testing these to
ensure robustness and performance. Please send us an email at
support@banyancomputing.com or
contact us on the Banyan Users Slack or
create a GitHub issue
so that we can prioritize the testing and quality assurance of these features
to meet your needs as soon as possible.
Grouping
You can group a DataFrame
by one or more columns with groupby
to produce
a GroupedDataFrame
. A GroupedDataFrame
can then be
transform
/
select
/
combine
/
subset
-ed into a DataFrame
.
Use combine
for group-by-aggregation and
subset
for filtering groups (similar to HAVING
in SQL). You can also aggregate entire columns without grouping.
Columns
Columns are Arrays
Columns are simply Banyan arrays. Once you have accessed a column, you can use Julia's standard library
array manipulation functions that have been supported in BanyanArrays.jl such as map
and reduce
. See
the API reference for BanyanArrays.jl for more details on how to manipulate columns of a Banyan data frame.
Getting Columns
Columns can be accessed with getindex(df, :, cols)
(or the equivalent short-hand df[:, cols]
).
Aggregating Columns
Columns can be aggregated with reduce
or other functions in BanyanArrays.jl such as
sum
, minimum
, maximum
. You can perform group-by aggregations with groupby
and combine
.
Properties and Futures
Functions that compute properties of data frames (nrow
, ncol
, size
, ndims
, names
, and propertynames
) and grouped data frames (length
, size
, ndims
, groupcols
, and valuecols
) immediately return computed results
while all other functions in BanyanDataFrames.jl return futures (data frames and grouped data frames
are futures).
Futures must be compute
d to actually run computation and collect a result
onto your local "client" machine. You must compute a future data frame before you use it in functions that have not yet been
supported as part of BanyanDataFrames.jl. For example, if you want to visualize a data frame produced by BanyanDataFrames.jl, make sure to first compute
it:
with_session(
cluster_name="Customer Data Analytics",
session_name="Hourly Customer Spendng for Today"
nworkers=128,
) do s
# Get the most recent visits from a Parquet dataset stored in an Amazon S3
# bucket
visits = BanyanDataFrames.read_parquet("s3://customer_data/visits.parquet")
most_recent_visits = filter(visit -> visit[:day] == 18, visits)
# Compute the result and collect it to your local "client" machine (e.g., a
# laptop)
V = compute(most_recent_visits)
# Now we can use Plots and StatsPlots to plot the data!
@df V scatter(
:hour,
:spending
)
end
A Word From Our Team of Banyaneers
Need a function that isn't listed here? Not sure how to implement your use-case? Please send us an email at support@banyancomputing.com or contact us on the Banyan Users Slack or create a GitHub issue so that we can meet your needs as soon as possible.