Skip to content

Using Amazon S3

Banyan lets you process big datasets that are either hosted on the Internet or stored in the cloud in your AWS account (the AWS account you set up when you got started). Connect your cluster to an S3 bucket that stores data in the cloud in your AWS account. Then you can read/write this data in sessions started on that cluster.

Connecting Banyan Clusters to S3 Buckets

When creating a cluster, you can specify an Amazon S3 bucket in your AWS account to associate with the cluster (a bucket will be created in your account if no existing bucket is specified). You can find the name of a cluster's S3 bucket either on the Banyan dashboard or with the get_cluster_s3_bucket_name(cluster_name) function.

An S3 bucket is like a folder that you can put files in. See below for uploading data to an S3 bucket.

Uploading Data to an S3 Bucket

There are several methods for uploading data to your cluster's S3 bucket:

  1. Banyan API
  2. aws s3 CLI
  3. S3FS Drag-and-Drop Method

Using Banyan.jl

The Banyan.jl package provides a function, upload_to_s3, for uploading files that are on the Internet, in S3, and on your local file system to your cluster's S3 bucket. Just provide the path to a file or directory, and it will get uploaded to the S3 bucket associated with your cluster. See the full upload_to_s3 documentation for more details.

Using the aws s3 CLI

You can also use the AWS CLI to upload data to your cluster's S3 bucket. First, ensure you have the AWS CLI installed. Then run get_cluster_s3_bucket_name(cluster_name) in the Julia REPL to retrieve the S3 path for your cluster's S3 bucket. Finally, run the following command in a terminal or command prompt:

aws s3 cp <path/to/filename> <s3_bucket_name>

Drag-and-Drop with S3FS

If you have S3FS installed, Banyan will attempt to use this to more efficiently access data stored in Amazon S3. Banyan will attempt to mount the cluster's S3 bucket in your local file system. Then, you can drag and drop your data files into the bucket to upload them. If you would like to force Banyan to not use S3FS, you should set the environment variable BANYAN_USE_S3FS=0.

Note that while setting up and using S3FS is completely optional, it may provide a performance improvement if you are using large datasets.

S3FS allows you to mount AWS S3 buckets. One-time setup to install S3FS is required when you get started. On Linux OS, you can run the following to install and set up S3FS. Replace ACCESS_KEY_ID and SECRET_ACCESS_KEY with your AWS credentials. See here for the full documentation.

sudo apt install s3fs
chmod 600 ${HOME}/.passwd-s3fs

See here for how to set up S3FS for Windows.

Once you have install S3FS, Banyan can mount your cluster's bucket on your local machine. Banyan will automatically do this if you create data with a remote location (i.e., S3). If this fails on Linux, please use the following to mount your cluster's S3 bucket on your local machine.

Provide your AWS region and the name of your cluster's bucket. You can use get_cluster_s3_bucket_name(cluster_name) to retrieve this information.

`s3fs $bucket $HOME/.banyan/mnt/s3/$bucket -o url=https://s3.$ -o endpoint=$region -o passwd_file=$HOME/.passwd-s3fs`)

Please refer to the documentation for WinS3FS for how to mount an S3 bucket on Windows.

Once you have mounted your cluster's S3 bucket with S3FS, navigate to the path returned by get_s3fs_bucket_path(cluster_name) and drag and drop files or folders into the location.

Reading/Writing Data From/To S3

Once you have data in your cluster's S3 bucket, any session started on a cluster can read from and write to files in that S3 bucket. See documentation for read_hdf5, write_hdf5, read_csv, write_csv , read_parquet, write_parquet, read_arrow, write_arrow in BanyanArrays.jl and BanyanDataFrames.jl for actually reading and writing data.