Demystifying Datastores
Datastores are high-performance, low-latency, versioned, and scalable datasets which can be instantly mounted into any Session or Run.
What file types are supported​
Datastores can be created from any file type. Grid treats each file as a collection of
bytes which exist with a particular name within a directory structure (e.g.
./dir/some-image.jpg
).
info
In order to ensure complete data privacy & flexibility of use, Grid never attempts to process the contents of the files or infer/optimize for any particular usage behaviors based on file contents.
Why use Datastores​
Datastores are backed by cloud storage. They are made available to
compute jobs as part of a read-only filesystem. If you have a compute script which reads
files in a directory structure on your local computer, then the only thing you need to
change when running on Grid is the location of the data
directory!
Datastores are a necessity when dealing with data at scale (e.g. data which cannot be reasonably downloaded from some HTTP URL when a compute job starts up) by providing a singular & immutable dataset resource of near unlimited scale.
tip
A single datastore can be mounted into tens or hundreds of concurrently running compute jobs in seconds, ensuring that no expensive compute time is wasted waiting for data to download, extract, or otherwise "process" before you can move on to the real work.
We know how important a role your data plays in everything you run on Grid. We've put an
incredible amount of time and effort into creating a truly unique optimization pipeline
which removes every bit of latency possible from the point your program calls with
open(filename, 'r') as f:
to the instant that data is provided to you. You'll find
traversing the data directory structure in a session
indistinguishable from the experience of cd
-ing around your local workstation.
Grid Datastores are an easy way to deal with data in the cloud. Our capabilities and performances are constantly evolving. We believe you'll love the simplicity and experience of using them!
Capabilities​
Datastores today have 3 main capabilities:
- Can be created from S3 buckets in both Grid Cloud and Bring Your Own Cloud environments
- Attachable to Runs and Sessions
- Create-able from your local machine, Sessions, or Cluster!
High-Performance Datastores (BYOC users only)​
High Performance Datastores (HPDs) allow Bring Your Own Cloud customers who are looking to scale large datasets to optimize latency and significantly speed up data access. Currently, HPDs are backed by the FSx for Lustre service and offer more scalability and higher throughput than conventional Grid datastores backed by AWS S3.
HPDs are most useful for very large datasets (>1TB) or when a dataset is going to be using by a large number of concurrent experiments or sessions.
How do I access the data in a datastore?​
By default, datastores are mounted at /datastores/<datastore-name>/
on both
Runs and Sessions.
If for some reason you need the mount path at a different location, you are able to
manually specify the Datastore mount path when using the CLI. Please refer to
CLI commands reference for assistance specifying the desired configuration.
Next Steps​
For more information on using Datastores, start with the first section of the Using Datastores tutorial.
More advanced users can feel free to skip to any of the other tutorials which may be of interest. These are linked in the Sidebar