Using Amazon S3
Banyan lets you process big datasets that are either hosted on the Internet or stored in the cloud in your AWS account (the AWS account you set up when you got started). Connect your cluster to an S3 bucket that stores data in the cloud in your AWS account. Then you can read/write this data in sessions started on that cluster.
Connecting Banyan Clusters to S3 Buckets
When creating a cluster, you can specify an Amazon S3
bucket in your AWS account to associate with the cluster (a bucket will be created in your
account if no existing bucket is specified).
You can find the name
of a cluster's S3 bucket either on
the Banyan dashboard or with the get_cluster_s3_bucket_name(cluster_name)
function.
An S3 bucket is like a folder that you can put files in. See below for uploading data to an S3 bucket.
Uploading Data to an S3 Bucket
There are several methods for uploading data to your cluster's S3 bucket:
Using Banyan.jl
The Banyan.jl package provides a function, upload_to_s3
, for uploading files that
are on the Internet, in S3, and on your local file system to your cluster's S3
bucket. Just provide the path to a file or directory, and it will get uploaded
to the S3 bucket associated with your cluster. See the
full upload_to_s3
documentation for more details.
Using the aws s3
CLI
You can also use the AWS CLI
to upload data to your cluster's S3 bucket.
First, ensure you have the AWS CLI
installed. Then run
get_cluster_s3_bucket_name(cluster_name)
in the Julia REPL to retrieve the S3 path for your cluster's S3 bucket.
Finally, run the following command in a terminal or command prompt:
aws s3 cp <path/to/filename> <s3_bucket_name>
Drag-and-Drop with S3FS
If you have S3FS installed, Banyan will attempt to use this to more efficiently
access data stored in Amazon S3. Banyan will attempt to mount the cluster's S3
bucket in your local file system. Then, you can drag and drop your data files
into the bucket to upload them. If you would like to force Banyan to not use S3FS,
you should set the environment variable BANYAN_USE_S3FS=0
.
Note that while setting up and using S3FS is completely optional, it may provide a performance improvement if you are using large datasets.
S3FS allows you to mount AWS S3 buckets. One-time setup to install S3FS is
required when you get started. On Linux OS, you can run the following
to install and set up S3FS. Replace ACCESS_KEY_ID
and SECRET_ACCESS_KEY
with your AWS credentials. See here
for the full documentation.
sudo apt install s3fs
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ${HOME}/.passwd-s3fs
chmod 600 ${HOME}/.passwd-s3fs
See here for how to set up S3FS for Windows.
Once you have install S3FS, Banyan can mount your cluster's bucket on your local machine. Banyan will automatically do this if you create data with a remote location (i.e., S3). If this fails on Linux, please use the following to mount your cluster's S3 bucket on your local machine.
Provide your AWS region and the name of your cluster's bucket. You can use
get_cluster_s3_bucket_name(cluster_name)
to retrieve this information.
`s3fs $bucket $HOME/.banyan/mnt/s3/$bucket -o url=https://s3.$region.amazonaws.com -o endpoint=$region -o passwd_file=$HOME/.passwd-s3fs`)
Please refer to the documentation for WinS3FS for how to mount an S3 bucket on Windows.
Once you have mounted your cluster's S3 bucket with S3FS,
navigate to the path returned by get_s3fs_bucket_path(cluster_name)
and
drag and drop files or folders into the location.
Reading/Writing Data From/To S3
Once you have data in your cluster's S3 bucket, any session started on a
cluster can read from and write to files in that S3 bucket.
See documentation for read_hdf5
, write_hdf5
, read_csv
, write_csv
, read_parquet
, write_parquet
, read_arrow
, write_arrow
in BanyanArrays.jl and BanyanDataFrames.jl
for actually reading and writing data.