Data Sampling
Julia-only
This feature is currently only supported in Julia. Python support coming soon.
Data sampling is the key to Banyan's vision for more sustainable data science. The Banyan vision can be summed up in a single sentence: instead of running all data science queries in cloud data center, why not make it possible to instantly switch between working with all your data in the cloud and working with samples of your data on your laptop? But first we need to explain what a data sample is.
What is a Data Sample?
A data sample is a randomly selected subset of a larger dataset.
Accessing Samples
In Banyan, everything you compute
can also be sample
d. Samples are automatically
maintained and you can access the sample
by calling sample
on the future. For example, you can read in a data frame
with s = read_csv("s3://sales/2020/09)
and collect the sampled DataFrame
with sample(s)
.
Sampling Use-Cases
There are two broad use-cases of samples in Banyan:
- "Sample first, compute later" for quickly accessing approximate results
- "Compute once, sample often" for avoiding running computation in cloud data centers
We have put together a Jupyter notebook that you can look through and even try out to see how both of these use-cases work.
Shuffled and Cached Samples
Samples and metadata are cached (in the Amazon S3 bucket for the cluster the currently running session) so that they can be reused. Read this section to find out ways you sometimes may want to change the default caching behavior.
All the parameters that are described in this section
are keyword arguments that can
be passed into read_hdf5
, write_hdf5
, read_csv
, write_csv
, read_parquet
, write_parquet
, read_arrow
, write_arrow
, etc.
Faster Sample Collection
Sometimes data is already shuffled or distributed evenly such that
it isn't necessary to read through the entire dataset to collect
a sample. We assume shuffled=true
by default but if this
assumption isn't correct (i.e., data distribution varies
wildly from file to file), then simply pass shuffled=false
into a read_*
function.
Invalidate Cache After External Modification
If you are reading from a location that has been modified by some other
user or non-Banyan actor, you may want to specify that the cached
source or sample is invalid. You can simply pass metadata_invalid=true
in a read_*
function to indicate that the files or contents of files
in the source location has changed. You can pass sample_invalid=true
to indicate that the schema or distribution of data has changed so much
that a new sample is required.
Preserve Cache When Writing
When using a write_*
function, the default behavior is to immediately
invalidate cached location sources and samples. However, sometimes
the data has not been changed much or at all to warrant a cache
invalidation. In that case, invalidate_metadata=false
and/or
invalidate_sample=false
can be passed into the writing function.
Invalidating Samples and Metadata
You can also invalidate the sample or metadata for a given path with invalidate_sample
or invalidate_metadata
.
To invalidate the samples and metadata for all locations read from or written to by the currently active cluster,
use invalidate_all_locations
.