Cloud machines are normally expensive. However, if your job can support being interrupted at any time (ie: fine tuning, model that can be restarted) then you could use spot instances in Grid to lower training and development costs.
Enable Spot Instances via the UI
Enable Spot Instances via the CLI
grid run --use_spot pl_mnist.py
Prepare code for interruption
To take advantage of interruptible machines, make sure of a few things:
- You are saving checkpoints or any state you need. Grid automatically picks these up into your artifacts.
- Make sure your code can be restarted from a checkpoint or state file.
Restarting interrupted jobs
Once the machine is interrupted, your job on Grid will stop. If you want to continue running your code do the following:
- Navigate to your experiment artifacts.
- Copy the link to the state files (or checkpoint) that you need.
- Resubmit the job with the path to that file.
For example, assume your script has an argument called
grid run --use_spot main.py --ck_path https://grid.ai/url/to/checkpoint.ckpt
If you have additional questions about Runs, visit the FAQ. The section is periodically updated this with common questions from the Grid community.