Tips & Tricks
Tips & Tricksβ
Interruptible Runsβ
Interruptible Runs powered by spot instances are 50-90% cheaper but a machine can be interrupted at anytime. If you are using PyTorch Lightning and a job gets interrupted you can load the checkpoints.
Grid helps you directly continue your Runs where you left off as follows.
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--checkpoint_path', type=str)
args = parser.parse_args()
if args.checkpoint_path:
trainer = LitModel.load_from_checkpoint(checkpoint_path=args.checkpoint_path)
grid run --g_use_spot train.py --checkpoint_path "Artifact URL "
For more information check out
Stop a Run after X hoursβ
As a convenience, we provide a GitHub Action for stopping a Run that has been running for X no of hours.
AutoStructuring Deep Learning Trainingβ
The recent 1.3 Release of PyTorch Lightning provides a new Lightning CLI [beta] for Auto Structuring Deep Learning Training.
from pytorch_lightning.utilities.cli import LightningCLI
LightningCLI(MyModel, MyData, trainer_defaults={'max_epochs': 10})
When combined with Grid, the Lightning CLI enhances your train scripts, enabling you quickly take advantage of any hardware configuration and perform Grid Data, Model and Trainer sweeps without having to integrate external libraries or add extra code. For more information check out:
- Auto Structuring Deep Learning Projects with the Lightning CLI
- Configuring Grid Hyper Parameter Sweeps
- PyTorch Lightning CLI Docs
Early Stoppingβ
The recent 1.3 Release of PyTorch Lightning provides 3 New Thresholds for Early Stopping (Stopping, Divergence, and Check Finite) that can save you significant money on your Grid Runs.
The EarlyStopping Callback in Lightning allows the Trainer to automatically stop when a given metric stops improving. You can define your own custom metrics or take advantage of our TorchMetrics package to select common metrics to log and monitor. Early Stopping is perfect for Grid Runs because it limits the time spent on experiments that lead to poor convergence or overfitting.
Using EarlyStopping Thresholds into your PL Runs is a simple as adding the following few lines to your code.
from pytorch_lightning.callbacks import EarlyStopping
from pytorch_lightning import Trainerearly_stopping = EarlyStopping(
monitor="val_loss",
stopping_threshold=1e-4, # Stops training immediately once the monitored quantity reaches this threshold
divergence_threshold=9.0, # Stops training as soon as the monitored quantity becomes worse than this threshold
check_finite=True, # Stops training if the monitored metric becomes NaN or infinite.
)
trainer = Trainer(callbacks=[early_stopping])
You can then pocket the savings or reinvest them into more promising configurations to take your model performance and convergence to the next level. For more information check out:
Sessions and data filesβ
Sometimes you need a quick way of copying files from your local machine to your grid session. I recently had to do this in order to configure the Kaggle CLI for a competition I was working on.Grid Sessions provide two clean options for uploading local files.
Secure Copy SCP
Once youβve configured SSH with the Grid CLI you can quickly copy files to your session with the scp command as follows
scp local_file grid_session_name:~path_to_copy_to/
Using JupyterLab****
If the CLI is not your thing; you can also upload files using jupyter hub
This video shows you how to do that. For more information check out:
- Grid Sessions Docs
- SSH into a Grid Session
- SCP Command
- JupyterLab with Sessions
- JupyterLab Uploading and Downloading Files
Keeping track of costsβ
Have you ever wanted to estimate exactly how much a cloud training run will cost you.Well with PyTorch Lightning and Grid now you can.
The recent 1.3 Release of PyTorch Lightning provides a new trainer flag called max_time that can enable you to stop your Grid Run and save a checkpoint when youβve reached the max allotted time.
Combined with Grid's ability to estimate how much a run will cost you per an hour you can use this flag to better budget your experiments.
# Default (disabled)
trainer = Trainer(max_time=None)# Stop after 12 hours of training or when reaching 10 epochs (string)
trainer = Trainer(max_time="00:12:00:00", max_epochs=10)# Stop after 1 day and 5 hours (dict)
trainer = Trainer(max_time={"days": 1, "hours": 5})
With this trick you can better manage your training budget and invest it into more promising configurations to take your model performance and convergence to the next level.For more information check out:
Periodic Uploads of Datasets to Datastoresβ
Did you know the you can schedule periodic uploading Datasets to Datastores ?
The machine learning data is dynamic. If a ML model was trained on data from 2018 it might model a term such as, βcoronaβ differently than a model that was continuously kept up to date. To help make pipelining easier Grid supports periodic uploading of data from its source to a Datastore.
Here is an example of how you can quickly configure this functionality.
write out current crontabβ
crontab -l > mycron
run datastore upload every hour every dayβ
echo "0 **** grid datastore create --source data/path --name dataset" >> mycron
install new cron fileβ
crontab mycron rm mycron