🥳 Releases
Upgrade your CLI with pip install lightning-grid --upgrade
❤️ Find us in our Slack Community to say hi and/or to express your thoughts/questions.
⚡ June 28, 2022
CLI version: 0.8.67
In addition to several stabilty improvements, this release introduces two very exciting new Datastores features for our BYOC users! If you're not a BYOC user, but would like to learn more or try out these features, don't hesitate to reach out to us at support@grid.ai
🗄️ What's New with Datastores
Private S3 Mounting (BYOC Users Only)
Grid now supports the ability to create Datastores from private AWS S3 buckets by using
the --no-copy
mode via the CLI. This is particularly valuable for incrementally adding data to the source bucket and for speeding up datastore creation when working with large buckets.
In order to allow Grid to access your private buckets,
you'll need to create an authorized AWS Role using the grid credential create --type s3
command (explained in detail in the link below). After creating a role, you can run the
grid datastore create S3://<private-bucket-name-here> --no-copy
command as usual - no
modifications needed.
Create a Datastore from a private S3 bucket
High-Performance Datastores (BYOC Users Only)
High Performance Datastores (HPDs) allow Bring Your Own Cloud customers who are looking to scale large datasets to optimize latency and significantly speed up data access. Currently, HPDs are backed by the FSx for Lustre service and offer more scalability and higher throughput than conventional Grid datastores backed by AWS S3.
HPDs are most useful for very large datasets (>1TB) or when a dataset is going to be using by a large number of concurrent experiments or sessions.
Create a High-Performance Datastore
note
If you are interested in learning more, or enabling either of these features, you can contact support@grid.ai
Session Memory Improvements
- Disabled virtual memory limiting for GPU machines in Sessions, preventing out of memory failures
- Grid Runs now default to 0 CPUs. We recently discovered an issue with runs where setting
--cpus
to 1 would also reduce the memory, causing lots of OOM issues. In previous versions of Grid, this was the default behavior. We've updated this behavior to set--cpus
to 0 by default. By setting--cpus
to 0, Grid will allocate all available CPU and memory to the experiment.
⚠️ June 24, 2022
CLI version: 0.8.65
This release includes an important update to how CPU and memory are allocated to experiments.
Prior to this release, Grid would set the default number of CPUs to 1 when creating runs and not explictly specifying --cpus
.
We recently discovered an issue with runs where setting --cpus
to 1 would also reduce the memory, causing lots of OOM issues.
So we've updated this behavior to set --cpus
to 0 by default. This applies when creating runs with GPUs as well. By setting --cpus
to 0, the backend will allocate all available CPU and memory to the experiment.
⚡ June 7, 2022
CLI version: 0.8.58
Grid Cloud Instance Types
We've made some changes to the platform that will impact start times for Sessions and Runs.
As a result of these changes, you'll experience longer start times for Sessions and Runs that use the p3.2xlarge
instance type. If you're looking for a faster start time, we suggest using the g4dn.xlarge
instance type instead.
In future Grid releases, the following instance types will be supported:
Name | CPU | GPU | Memory | Accelerator | numberOfAccelerators acceleratorType availableMemory |
---|---|---|---|---|---|
m5a.large (recommended for fast startup times) | 2 | 0 | 8 | CPU | 2_CPU_8GB |
m5a.2xlarge | 8 | 0 | 32 | CPU | 8_CPU_32GB |
g4dn.xlarge (recommended for fast startup times) | 4 | 1 | 16 | T4 | 1_T4_16GB |
p3.2xlarge | 8 | 1 | 61 | V100 | 1_V100_61GB |
p3.8xlarge | 32 | 4 | 244 | V100 | 4_V100_244GB |
Why have we made these changes?
We closely monitor usage of Grid and are always looking for improvements that will make the platform more straightforward, easier to use, and cost-effective. In changing how we manage certain instance types, we're able to offer faster start times on cheaper instances. Managing these instance types is a key area that will make Grid more sustainable and less expensive to use in the long term. We always want to ensure that Grid users are getting the compute resources they need at a price that is fair and transparent.
BYOC Instance Types
If you are currently using the BYOC feature, you will continue to have access to the full list of supported AWS instance types. If you are not currently using BYOC and want access to or information about additional instance types, reach out to us at support@grid.ai.
If you've got questions about these changes, reach out to us at support@grid.ai.
Fixes and Enhancements
Adds UI support for skipping parameter evaluation when running hyperparemeter sweeps
Improvements to the process of integrating Grid with public and private Github organizations
BYOC users: Fixes issue with starting runs with unavailable instance types. If the default instance type is not available, the first instance in the specified list of instances will be used instead
Stability improvements in the UI to make analzying experiment results a better experience
Better error messaging in the CLI
Fixes CLI issue where users could only retrieve the 50 most recent runs. To request details for a specific run in your run history, use
grid status RUN_NAME
⚠️ Known Issues
When creating a run in the UI, specify the path to the Github repo where the script is located. Providing the URL to the specific script is not currently supported.
When creating a Datastore, data directories that contain soft symlinks files will cause the Datastore upload to fail. To prevent this failure, update soft symlinks to hard links.
🥳 May 17, 2022
CLI version: 0.8.47
Today's release includes several bug fixes to improve the overall experience with Grid.
Fixes and Enhancements:
Faster experiment failing when errors are encountered during build or code execution
Improves the Run-creation flow in the Web UI by fixing error messages reported due to insufficient repo acess or invalid repos
Stability improvements to the UI and event reloading
Fixes experience with the drop-down in the experiments table which allows you to add hyperparameter columns
Allows support for nested requirements.txt files: Ex:
-r ./base.txt
# install all extra dependencies for full package testing
-r ./extra.txt
# install all loggers for full package testing
-r ./loggers.txt
# extended list of dependencies for development and run lint and tests
-r ./test.txt
# install all extra dependencies for running examples
-r ./examples.txt
🥳 May 12, 2022
CLI version: 0.8.45
New and Improved Artifacts!
Today, we release an update to Artifacts which greatly improves stability and UX in the following ways:
- Ensures syncing of artifacts for fast-running experiments
- Ensures all artifacts that are produced by experiments are copied by Grid
- When the experiment stops running, the instance will not shut down until all artifacts have been copied
note
Note: With this change, a portion of instance CPU and RAM will be dedicated to artifact syncing processes. For users with memory-intensive code, if your code generates artifacts of size >= 1GB, you may experience a decrease in performance. In these scenarios, we recommend using an instance with more CPU/RAM.
Learn more about Artifacts and these new improvments here.
Additional Fixes and Enhancements
- Fixes issue with calculating pricing estimate during new run creation.
- Improves handling of Session in the event that a process goes out of memory. In these events, the process will be terminated but the Session will remain running.
🔧 May 3, 2022
CLI version: 0.8.37
Datastore Enhancements
⭐ Faster S3 Datastores!
We are happy to announce that, as of today, creating datastores from S3 buckets is almost instant!
In most cases, your S3 bucket will fit one (or both) of the following criteria:
- the bucket is continually updating with new data which you want included in a Grid datastore
- the bucket is particularly large (leading to long datastore creation times)
In both of these cases, you can pass the --no-copy
flag to the grid datastore create
command. This flag will prevent Grid from making a copy of the dataset, which significantly speeds up datastore creation time when working with large buckets or when you intend to make incremental changes to your bucket and do not want to re-upload the entire dataset each time you add a new file.
Here's an example:
grid datastore create S3://ruff-public-sample-data/esRedditJson --no-copy
note
Please note that direct access to private S3 buckets is not currently supported.
Fixes and Enhancements
[Enhancement] When specifying instance types with the
grid session change-instance-type
command, you can use either the instance name (ex:grid session change-instance-type splendid-banzai-981 2_CPU_4GB
) or instance nickname (ex:grid session change-instance-type splendid-banzai-981 t2.medium
) interchangeably[Enhancement] Grid's syntax for scheduling multiple experiments with combinations of arguments (ie. Grid Search or Random Search) sometimes might conflict with the expected script arguments. That's when you can use none strategy for parameter evaluation. More details can be found here
[Fix] Resolves an issue with creating Runs from the UI using the random search strategy when the nunmber of trials > experiments.
[Deprecated] Changing Session instance type from the UI is currently not supported.
🥳 April 13, 2022
CLI version: 0.8.26
Notable Fixes and Enhancements
- Adds a new option for skipping parameter evaluation when not using the grid search or random search HPO features. More details here
- Resolves issues with artifacts not saving correctly to experiment sub-directories
🥳 March 30, 2022
CLI version: 0.8.17
This release includes bug fixes and stability improvements.
We've deprecated the following CLI options:
grid run --description
grid stop session
🥳 March 15, 2022
CLI version: 0.8.7
🤯 GRID_SESSION_ID and GRID_SESSION_NAME environment variables
We've added two environment variables that allow you to programmatically reference a Session from within the Session itself.
🔧 March 10, 2022
CLI version: 0.8.4
✔️ Resolves an issue where using a relative path for the dependency_file_info
property in a Run config was breaking. For example, this now works if you were operating from a subdirectory of a git repo:
```# Dependency file specification
dependency_file_info:
package_manager: conda
path: ./env/env-deepcdl-pytorch.yml ```
✔️ Support for specifying version of Julia image to use in Runs. We will support every patch release of julia from 1.6.1 up.
grid run --framework julia
will use the latest Julia version available (currently 1.7.1)
grid run --framework julia:X.Y.Z
will use Julia with the version X.Y.Z
✔️ Runs will fail more quickly if there is an issue with image building.
✔️ Resolves issue with --num_trials
parameter being ignored.
✔️ Logging improvements to silence noisy stacktraces.
✔️ 'pytorch' and 'torch' are now both equal and acceptable inputs to the framework option for grid run
ex: --framework pytorch
== --framework torch
🔧 March 1, 2022
CLI version: 0.8.1
Spring cleaning came early. This release features a lot of backend magic that improves overall stability and UX with Grid. We’re also excited to announce a dazzling set of enhancements to Datastores! You’ll notice uploading to Datastores is now at least 5x faster! More details and information on how to use the feature are below.
Datastore Enhancements
- Datastore upload speeds increased by 5x
- Improved stability during Datastore uploads (reduced chance of failure during upload)
- Disk space usage will no longer increase during Datastore upload
- If a Datastore gets interrupted during upload, the next time you create a Datastore, you will be prompted to resume the upload
- The
--source
parameter has been deprecated. It will no longer be supported in future releases. You can just usegrid datastore create [filename]
and the datastore will inherit the filename as its name - Additional magical backend improvements that you can't see, but certainly will feel
Notable Fixes and Enhancements
grid run
help menu includes additional information about the--localdir
option- The following actions have been added to the YAML config:
- on_build_start
- on_build_end
- on_experiment_start
- on_experiment_end (See the docs on Actions for more information)
- Newly created datastores with total size <1 MiB will report as 1 MiB total size
- Improvements to costs reporting for runs and experiments
⚠️ February 3, 2022
Artifacts don't sync for fast experiments
We've detected a race condition with short-running experiments which may cause artifacts not to be properly synced. We're working on a long-term solution for this, and will be fixed in the coming days. As a workaround, we recommend ensuring your experiments last at least a minute (to be safe), and sleep if needed. We are working on resolving this issue to be addressed in the next release.
🔧January 12, 2022
CLI version: 0.7.3 A maintenance release has been issued with the following :
- resolves an issue that was causing experiments to remain queued for 1 hour+
- fixes issue where Datastores and Runs couldn’t be viewed from the UI
- addresses an issue with Multinode Runs that were not running
Cluster Contexts
For users Bringing Your Own Cloud, we've introduced the concept of cluster contexts. You can set the cluster context so that all your CLI actions (including creation of a resource such as Run or Session) are made against that cluster.
By default, the cluster context is set to the global cluster. You can change the context at anytime by using the command: grid user set-cluster-context
or by specifying the cluster name in ~/.grid/settings.json.
Find out which cluster context is currently set by using the grid user
command.
More information in the documentation on how to 'Run Workloads in Your New Cluster'.
🥳 January 5, 2022
CLI version: 0.7.1
Hi! Welcome to 2022 :) Today we bring you a new Grid release with exciting new features, continued performance and stability improvements, and the beginnings of a very productive new year. As always use pip install lightning-grid --upgrade
to update the CLI to the new version and hit us up in our Slack Community with any thoughts or questions.
Auto-resume Experiments
Surprise! You can now enable the auto-resume of experiments that are running on spot instances. Should your experiment be interrupted, Grid can automatically resume your experiment from the last saved checkpoint when a new instance becomes available.
And more good things:
- Grid will recover all artifacts, including the last saved checkpoints.
- The local filesystem will be preserved between experiment interruption and experiment resumption.
Note
🪄 Enable Auto-resume in the UI
Select the “Auto-resume” option after enabling the Use Spot Instance
option in a new Run.
🪄 Enable Auto-resume in the CLI
Use -auto_resume
flag to indicate this experiment is safe to resume.
Example: grid run --use_spot --auto_resume --instance_type p3.2xlarge [mnist.py](<http://mnist.py/>)
Datastore Enhancements
⭐ Full S3 Datastore Support
You can now connect Grid to any publicly available S3 dataset, making it way faster to get your S3 data into Grid.
Specify a public S3 bucket, file, or path when creating a new Datastore.
🪄 Supported URL formats:
s3://ryft-public-sample-data/
https://public_url.zip
Note
⭐ Datastore Mount Path
And the award for top FAQ goes to...
How do I access my data in a datastore?
With this release, accessing your data in a Session or Run is way more straightforward.
After you’ve created a datastore, you can access it at /datastores
in a Session or Run.
More details on how to mount datastores:
Attaching Datastores to a Session
Fixes and Enhancements
- Performance improvements to Sessions, making your data on a Session faster to access once the Session is active from resuming.
- Increased observability into Session statuses and reasons for a potential Session failure.
- Hover over the status of a Datastore, Session, or Experiment for more details on the status.