Now that we all have the same data, let's start the real tutorial!
Step 1: Create a datastore
(Time: 2 minutes)
In a realistic workflow, we would start here. The first thing you want to do is to create a DATASTORE on Grid with your dataset. The datastore will optimize your data for low-latency, and high-availability to any machine you run on Grid.
Now that your data has been uploaded the next step in a real workflow is to spend time doing any of the following:
Develop/Debug the model
Prototyping it on multiple GPUs
Adjusting the batch size to maximize GPU usage
Using the model for analysis, which might require GPUs
Explore and visualize the model
This is exactly what Sessions were created for.
Start a Session named resnet-debugging with 2 M60 GPUs on it and attach our CIFAR-5 dataset.
Note: A credit card needs to be added to use GPU machines
Sessions really shine with huge datasets. The automatic mounting feature means you can jump straight to work without waiting a long time for your data to be available.
TIP: If you prefer to ssh directly or use VSCode (instead of using Jupyter lab), the other icons have setup instructions.
Step 3: Develop the model
(Time: 5 minutes)
Now that you have your data, code, and 2 GPUs, we get to the fun part! Let's develop the model
For this tutorial, I'm going to use a non-trivial project structure that is representative of realistic use cases [code link].
The project has this structure
This folder is complicated on purpose to showcase that Grid is designed for realistic deep learning workloads. I'm purposely avoiding simple projects (code reference) that look like this (since those are trivial for Grid to handle.)
For best practices structuring machine learning projects in general, read our guide. (Coming Soon)
On the session you would normally:
tune batch size
We're going to run the model using the following instructions. This GIF illustrates what we are about to do).
Clone the project on the interactive Session
git clone https://github.com/williamFalcon/cifar5
Install requirements + project
sudo pip install -r requirements.txt
pip install -e .
now run the following command to train a resnet50 on 2 GPUs
python project/lit_image_classifier.py \
--data_dir ~/datastore \
You should see the results (the script is designed to overfit the val split)