Typical workflow (CLI user)

Using the command line interface

Goal

The goal of this tutorial is to walk through a typical workflow using the Grid command-line (CLI) package.

For users who prefer the web app this is the mirror tutorial for the web app.

The Grid CLI app has a 1:1 match in functionality with the Web app.

We'll use image classification as the example to illustrate the key ideas. The typical workflow goes like this:

Note a few things:

  • The dataset is small so the tutorial can be quick. But the workflow doesn't change for large-scale data.

  • We'll use PyTorch Lightning for simplicity, but the framework can be any of your choice.

Tutorial time: 19 minutes

time

step

1 minute

Installing the Grid CLI

2 minutes

Preparing the dataset

2 minutes

Creating a Datastore

2 minutes

Starting a Session

1 minute

ssh connect to the Session

5 minutes

prototype the model

1 minute

Pause the Session

1 minute

Run (hyperparameter sweep)

3 minutes

Bonus: Become a power user

Terminology Glossary

Term

Description

CIFAR-5

A dataset with 5 classes (airplane, automobile, ship, truck, bird).

grid datastore

High-performance, low-latency, auto-versioned dataset.

grid run

Runs a model (or many models) on a cloud machine (hyperparam sweep)

grid session

A LIVE machine with 1 or more GPUs for developing models

An experiment

A single model with a given configuration

A run

A collection of experiments

ssh

A way to connect from a local machine to a remote machine

The dataset

For this tutorial we'll be using CIFAR-5. This is a subset of CIFAR-10 that we've chosen to make the models train faster.

CIFAR-5 (modified from CIFAR-10)

The goal is to teach a small neural network to classify these 5 sets of classes.

Step 0: Install the Grid CLI

(Time: 1 minute)

It's recommended to use a virtual environment to run with Grid. You can use conda or venv (If you're using MacOS, we recommend using venv).

pip install lightning-grid --upgrade

Now login

grid login

This will open the browser to your settings. If you signed up to Grid with Google, your username is your email address. If you used Github your username is your Github username.

If your machine doesn't support browsers, use this (get your username and key here)

grid login --username YOUR_USERNAME --key YOUR_API_KEY

You'll only have to do this once!

Step 1: Prepare the dataset

(Time: 2 minutes)

In a real workflow, you would already have the data locally or on a cluster. To make sure we are all using the same data, download the dataset to your machine and unzip it.

# download
curl https://pl-flash-data.s3.amazonaws.com/cifar5.zip -O cifar5.zip
# unzip
unzip cifar5.zip

This should create a folder with this structure:

Now that we all have the same data, let's start the real tutorial!

Hint: The UI can create a datastore from a .zip... this is just for tutorial purposes.

Step 2: Create a datastore

(Time: 2 minutes)

In a realistic workflow, we would start here. The first thing you want to do is to create a DATASTORE on Grid with your dataset. The datastore will optimize your data for low-latency, and high-availability to any machine you run on Grid.

Now create the datastore which will upload your dataset and optimize it

grid datastore create --source cifar5/ --name cifar5

make sure it was created

grid datastore list
Once it's succeeded, it's ready to be used

Note: The datastore status moves through as series of statuses while it is being optimized. When it moves to "Succeeded" it's ready to be used.

Periodic uploads

In certain cases your data might change every few hours. In these cases, you can add the datastore create command to your crontab. Grid will automatically version the datastore for you.

#write out current crontab
crontab -l > mycron
#run datastore upload every hour every day
echo "0 * * * * grid datastores create --source cifar5/ --name cifar5" >> mycron
#install new cron file
crontab mycron
rm mycron

Step 3: Create ssh keys (optional)

(Time: 1 minute)

This is optional, but enables you to

  • ssh from your local

  • ssh + VSCode

Create the ssh keys and add them to Grid

You need to do this step only once

# make the ssh key (if you don't have one)
ssh-keygen -b 2048 -t rsa -f ~/.ssh/grid_ssh_creds -q -N ""
# add the keys to grid
grid ssh-keys add key_1 ~/.ssh/grid_ssh_creds

Step 4: Start a Session

(Time: 3 minutes)

Now that your data has been uploaded the next step in a real workflow is to spend time doing any of the following:

  • Debugging the model

  • Prototyping it on multiple GPUs

  • Adjusting the batch size to maximize GPU usage

  • Using the model for analysis, which might require GPUs

  • Exploring and visualize the model

This is exactly what Sessions were created for.

Start a Session named resnet-debugging **with 2 M60 GPUs on it and attach our CIFAR-5** dataset.

Note: A credit card needs to be added to use GPU machines

grid session create \
--instance_type g3.8xlarge \
--name resnet-debugging \
--datastore_name cifar5 \
--datastore_version 1

See if it's ready

grid status

Step 5: Connect to the Session

(Time: 1 minute)

Once the session is ready, you have three options to interact with it:

Let's login to the Session via SSH.

grid session ssh resnet-debugging

Now you're on the cloud machine! See how many GPUs you have

nvidia-smi

List the datastore

ls ~/datastore

Now you can code away!

git clone https://github.com/williamFalcon/cifar5
# debug, prototype, etc...
# push changes when done
git commit -am "..."
git push

Step 6: Develop the model

(Time: 5 minutes)

Now that you have your data, code, and 2 GPUs, we get to the fun part! Let's develop the model

At the end of the last section you used ssh to make model changes. However, I actually prefer to use VSCode for this. Let's set up VSCode to code directly on the remote machine.

First, launch VSCode.

Install the Remote Development extension

ssh into the interactive

grid session ssh resnet-debugging

Now link up VSCode with the Session

grid session ssh vscode

The model

For this tutorial, I'm going to use a non-trivial project structure that is representative of realistic use cases [code link].

The project has this structure

This folder is complicated on purpose to showcase that Grid is designed for realistic deep learning workloads. I'm purposely avoiding simple projects (code reference) that look like this (since those are trivial for Grid to handle.)

For best practices structuring machine learning projects in general, stay tuned for a best practices guide

Clone the project on the interactive Session

git clone https://github.com/williamFalcon/cifar5

Install requirements + project

cd cifar5
sudo pip install -r requirements.txt
pip install -e .

now run the following command to train a resnet50 on 2 GPUs

python project/lit_image_classifier.py \
--data_dir ~/datastore \
--gpus 2 \
--accelerator 'ddp' \
--backbone resnet50

You should see the results (the script is designed to overfit the val split)

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 1.0, 'test_loss': 1.2107692956924438}
--------------------------------------------------------------------------------

At this step (in a real workflow) you would code the model, debug, etc... using the remote GPUs from your local VSCode :)

Once you're ready, commit your changes so we can train at scale

git commit -am "changes"
git push

Step 7: Pause the Session

(Time: 1 minute)

Great! now that our model is ready to run at scale, we can pause the session.

grid session pause resnet-debugging

If you're tired of rebuilding environments every time you want to do a little bit of work, then pausing is your saving grace. Pausing:

  • Saves your files

  • Data

  • Environment (installed packages, etc)

In addition, a paused session STOPS THE COST OF THE SESSION!

Step 8: RUN (hyperparam sweep)

(Time: 1 minute)

Once your model is ready to go, you usually want to train it to convergence. If you already know a good set of hyperparameters then your run will be very simple since it will train a single model.

If you'd like to find better hyperparameters for your model, a RUN can launch multiple variations of your model to try all hyperparameters at once.

First always commit changes and push to GitHub. Grid runs the latest version of your code (based on whatever your local branch is).

git commit -am "ready to run"
git push

Now let's kick off a RUN.

First make sure we're all in the same folder for the tutorial

cd cifar5/project
ls
# __init__.py lit_image_classifier.py

Now kick off the run with grid run

grid run \
--datastore_name cifar5 \
--datastore_version 1 \
--datastore_mount_dir /cifar5 \
--instance_type 2_m60_8gb \
--framework lightning \
--gpus 2 \
lit_image_classifier.py \
--backbone "['resnet50', 'resnet34', 'resnet18']" \
--learning_rate "uniform(1e-5, 1e-1, 5)" \
--data_dir /cifar5 \
--gpus 2

You can do this from the Session or your local machine (but you'll need to clone the project locally)

Bonus: Use a YAML for common runs

When your runs get repetitive or if they have a lot of hyperparameters, use a YAML to save the run configuration.

Check out the YML documentation