Typical workflow (CLI user)
Using the command line interface

Goal

The goal of this tutorial is to walk through a typical workflow using the Grid command-line (CLI) package.
For users who prefer the web app this is the mirror tutorial for the web app.
The Grid CLI app has a 1:1 match in functionality with the Web app.
We'll use image classification as the example to illustrate the key ideas. The typical workflow goes like this:
Note a few things:
    The dataset is small so the tutorial can be quick. But the workflow doesn't change for large-scale data.
    We'll use PyTorch Lightning for simplicity, but the framework can be any of your choice.

Tutorial time: 19 minutes

time
step
1 minute
Installing the Grid CLI
2 minutes
Preparing the dataset
2 minutes
Creating a Datastore
2 minutes
Starting a Session
1 minute
ssh connect to the Session
5 minutes
prototype the model
1 minute
Pause the Session
1 minute
Run (hyperparameter sweep)
3 minutes
Bonus: Become a power user

Terminology Glossary

Term
Description
CIFAR-5
A dataset with 5 classes (airplane, automobile, ship, truck, bird).
grid datastore
High-performance, low-latency, auto-versioned dataset.
grid run
Runs a model (or many models) on a cloud machine (hyperparam sweep)
grid session
A LIVE machine with 1 or more GPUs for developing models
An experiment
A single model with a given configuration
A run
A collection of experiments
ssh
A way to connect from a local machine to a remote machine

The dataset

For this tutorial we'll be using CIFAR-5. This is a subset of CIFAR-10 that we've chosen to make the models train faster.
CIFAR-5 (modified from CIFAR-10)
The goal is to teach a small neural network to classify these 5 sets of classes.

Step 0: Install the Grid CLI

(Time: 1 minute)
It's recommended to use a virtual environment to run with Grid. You can use conda or venv (If you're using MacOS, we recommend using venv).
1
pip install lightning-grid --upgrade
Copied!
Now login
1
grid login
Copied!
This will open the browser to your settings. If you signed up to Grid with Google, your username is your email address. If you used Github your username is your Github username.
If your machine doesn't support browsers, use this (get your username and key here)
1
grid login --username YOUR_USERNAME --key YOUR_API_KEY
Copied!
You'll only have to do this once!

Step 1: Prepare the dataset

(Time: 2 minutes)
In a real workflow, you would already have the data locally or on a cluster. To make sure we are all using the same data, download the dataset to your machine and unzip it.
1
# download
2
curl https://pl-flash-data.s3.amazonaws.com/cifar5.zip -O cifar5.zip
3
4
# unzip
5
unzip cifar5.zip
Copied!
This should create a folder with this structure:
Now that we all have the same data, let's start the real tutorial!
Hint: The UI can create a datastore from a .zip... this is just for tutorial purposes.

Step 2: Create a datastore

(Time: 2 minutes)
In a realistic workflow, we would start here. The first thing you want to do is to create a DATASTORE on Grid with your dataset. The datastore will optimize your data for low-latency, and high-availability to any machine you run on Grid.
Now create the datastore which will upload your dataset and optimize it
1
grid datastore create --source cifar5/ --name cifar5
Copied!
make sure it was created
1
grid datastore list
Copied!
Once it's succeeded, it's ready to be used
Note: The datastore status moves through as series of statuses while it is being optimized. When it moves to "Succeeded" it's ready to be used.
Periodic uploads
In certain cases your data might change every few hours. In these cases, you can add the datastore create command to your crontab. Grid will automatically version the datastore for you.
1
#write out current crontab
2
crontab -l > mycron
3
4
#run datastore upload every hour every day
5
echo "0 * * * * grid datastores create --source cifar5/ --name cifar5" >> mycron
6
7
#install new cron file
8
crontab mycron
9
rm mycron
Copied!

Step 3: Create ssh keys (optional)

(Time: 1 minute)
This is optional, but enables you to
    ssh from your local
    ssh + VSCode
Create the ssh keys and add them to Grid
You need to do this step only once
1
# make the ssh key (if you don't have one)
2
ssh-keygen -b 2048 -t rsa -f ~/.ssh/grid_ssh_creds -q -N ""
3
4
# add the keys to grid
5
grid ssh-keys add key_1 ~/.ssh/grid_ssh_creds
Copied!

Step 4: Start a Session

(Time: 3 minutes)
Now that your data has been uploaded the next step in a real workflow is to spend time doing any of the following:
    Debugging the model
    Prototyping it on multiple GPUs
    Adjusting the batch size to maximize GPU usage
    Using the model for analysis, which might require GPUs
    Exploring and visualize the model
This is exactly what Sessions were created for.
Start a Session named resnet-debugging **with 2 M60 GPUs on it and attach our CIFAR-5** dataset.
Note: A credit card needs to be added to use GPU machines
1
grid session create \
2
--instance_type g3.8xlarge \
3
--name resnet-debugging \
4
--datastore_name cifar5 \
5
--datastore_version 1
Copied!
See if it's ready
1
grid status
Copied!

Step 5: Connect to the Session

(Time: 1 minute)
Once the session is ready, you have three options to interact with it:
Let's login to the Session via SSH.
1
grid session ssh resnet-debugging
Copied!
Now you're on the cloud machine! See how many GPUs you have
1
nvidia-smi
Copied!
List the datastore
1
ls ~/datastore
Copied!
Now you can code away!
1
git clone https://github.com/williamFalcon/cifar5
2
3
# debug, prototype, etc...
4
5
# push changes when done
6
git commit -am "..."
7
git push
Copied!

Step 6: Develop the model

(Time: 5 minutes)
Now that you have your data, code, and 2 GPUs, we get to the fun part! Let's develop the model
At the end of the last section you used ssh to make model changes. However, I actually prefer to use VSCode for this. Let's set up VSCode to code directly on the remote machine.
First, launch VSCode.
Install the Remote Development extension
ssh into the interactive
1
grid session ssh resnet-debugging
Copied!
Now link up VSCode with the Session
1
grid session ssh vscode
Copied!
The model
For this tutorial, I'm going to use a non-trivial project structure that is representative of realistic use cases [code link].
The project has this structure
This folder is complicated on purpose to showcase that Grid is designed for realistic deep learning workloads. I'm purposely avoiding simple projects (code reference) that look like this (since those are trivial for Grid to handle.)
For best practices structuring machine learning projects in general, stay tuned for a best practices guide
Clone the project on the interactive Session
1
git clone https://github.com/williamFalcon/cifar5
Copied!
Install requirements + project
1
cd cifar5
2
3
sudo pip install -r requirements.txt
4
pip install -e .
Copied!
now run the following command to train a resnet50 on 2 GPUs
1
python project/lit_image_classifier.py \
2
--data_dir ~/datastore \
3
--gpus 2 \
4
--accelerator 'ddp' \
5
--backbone resnet50
Copied!
You should see the results (the script is designed to overfit the val split)
1
--------------------------------------------------------------------------------
2
DATALOADER:0 TEST RESULTS
3
{'test_acc': 1.0, 'test_loss': 1.2107692956924438}
4
--------------------------------------------------------------------------------
Copied!
At this step (in a real workflow) you would code the model, debug, etc... using the remote GPUs from your local VSCode :)
Once you're ready, commit your changes so we can train at scale
1
git commit -am "changes"
2
git push
Copied!

Step 7: Pause the Session

(Time: 1 minute)
Great! now that our model is ready to run at scale, we can pause the session.
1
grid session pause resnet-debugging
Copied!
If you're tired of rebuilding environments every time you want to do a little bit of work, then pausing is your saving grace. Pausing:
    Saves your files
    Data
    Environment (installed packages, etc)
In addition, a paused session STOPS THE COST OF THE SESSION!

Step 8: RUN (hyperparam sweep)

(Time: 1 minute)
Once your model is ready to go, you usually want to train it to convergence. If you already know a good set of hyperparameters then your run will be very simple since it will train a single model.
If you'd like to find better hyperparameters for your model, a RUN can launch multiple variations of your model to try all hyperparameters at once.
First always commit changes and push to GitHub. Grid runs the latest version of your code (based on whatever your local branch is).
1
git commit -am "ready to run"
2
git push
Copied!
Now let's kick off a RUN.
First make sure we're all in the same folder for the tutorial
1
cd cifar5/project
2
ls
3
4
# __init__.py lit_image_classifier.py
Copied!
Now kick off the run with grid run
1
grid run \
2
--datastore_name cifar5 \
3
--datastore_version 1 \
4
--datastore_mount_dir /cifar5 \
5
--instance_type 2_m60_8gb \
6
--framework lightning \
7
--gpus 2 \
8
lit_image_classifier.py \
9
--backbone "['resnet50', 'resnet34', 'resnet18']" \
10
--learning_rate "uniform(1e-5, 1e-1, 5)" \
11
--data_dir /cifar5 \
12
--gpus 2
Copied!
You can do this from the Session or your local machine (but you'll need to clone the project locally)

Bonus: Use a YAML for common runs

When your runs get repetitive or if they have a lot of hyperparameters, use a YAML to save the run configuration.
Check out the YML documentation
Last modified 2mo ago