Data Sampling

Julia-only

This feature is currently only supported in Julia. Python support coming soon.

Data sampling is the key to Banyan's vision for more sustainable data science. The Banyan vision can be summed up in a single sentence: instead of running all data science queries in cloud data center, why not make it possible to instantly switch between working with all your data in the cloud and working with samples of your data on your laptop? But first we need to explain what a data sample is.

What is a Data Sample?

A data sample is a randomly selected subset of a larger dataset.

Accessing Samples

In Banyan, everything you compute can also be sampled. Samples are automatically maintained and you can access the sample by calling sample on the future. For example, you can read in a data frame with s = read_csv("s3://sales/2020/09) and collect the sampled DataFrame with sample(s).

Sampling Use-Cases

There are two broad use-cases of samples in Banyan:

"Sample first, compute later" for quickly accessing approximate results
"Compute once, sample often" for avoiding running computation in cloud data centers

We have put together a Jupyter notebook that you can look through and even try out to see how both of these use-cases work.

Shuffled and Cached Samples

Samples and metadata are cached (in the Amazon S3 bucket for the cluster the currently running session) so that they can be reused. Read this section to find out ways you sometimes may want to change the default caching behavior.

All the parameters that are described in this section are keyword arguments that can be passed into read_hdf5, write_hdf5, read_csv, write_csv , read_parquet, write_parquet, read_arrow, write_arrow, etc.

Faster Sample Collection

Sometimes data is already shuffled or distributed evenly such that it isn't necessary to read through the entire dataset to collect a sample. We assume shuffled=true by default but if this assumption isn't correct (i.e., data distribution varies wildly from file to file), then simply pass shuffled=false into a read_* function.

Invalidate Cache After External Modification

If you are reading from a location that has been modified by some other user or non-Banyan actor, you may want to specify that the cached source or sample is invalid. You can simply pass metadata_invalid=true in a read_* function to indicate that the files or contents of files in the source location has changed. You can pass sample_invalid=true to indicate that the schema or distribution of data has changed so much that a new sample is required.

Preserve Cache When Writing

When using a write_* function, the default behavior is to immediately invalidate cached location sources and samples. However, sometimes the data has not been changed much or at all to warrant a cache invalidation. In that case, invalidate_metadata=false and/or invalidate_sample=false can be passed into the writing function.

Invalidating Samples and Metadata

You can also invalidate the sample or metadata for a given path with invalidate_sample or invalidate_metadata. To invalidate the samples and metadata for all locations read from or written to by the currently active cluster, use invalidate_all_locations.