Training Deep Learning Models with Pytorch and Wisp
Model training often requires access to large scale compute and specialized GPU's. Wisp is ideal for this use case, as the platform enables cloud agnostic:
- Resource discovery and optimization, enabling better decision making on instance type and cloud provider.
- VM provisioning and setup, removing hurdles of setting up access points and environments
- Direct access to the VM, for debugging without having to rerun the entire pipeline.
- Simple job runner for keeping track of what's running, it's status as well as autostop and e-mail notifications on job completion.
Let's set up a sample training pipeline to see how Wisp supports the workflow!
First, make sure you have followed Quickstart and
installed the CLI on your local machine. Once done, you should be able to execute
wisp auth
in your terminal, and log in with your browser.
Overview
In this example, we will train a Ultralytics/yolov5 model on a remote server managed by Wisp. We will follow a local first workflow, where most development is done locally, and transferred to the remote only for job execution. This limits cost of keeping the server running, and make it easier to debug project specific errors.
Set up local environment
Start by downloading the project to a location on your local machine:
git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -r requirements.txt
You can do local development, adapt the code and make any changes you'd like locally.
When we move to the next step, the complete workspace will be transferred to the remote
server for training. For this example, we will train the model with COCO128
(the first
128 images from COCO).
Set up Wisp
Wisp uses a wisp-spec.yml file to keep track of
your project, dependencies and configurations. To create a blank file, cd yolov5
and
execute wisp init
.
This will create a wisp-spec.yml
file in your project directory. For now, it is not
constrained, so we need to add a GPU in order for the training to work.
Open the wisp-spec.yml
file with an editor of your choice. We need to update the
following fields. You can see the final specification below.
setup
: add a script step, and add the three commands from above (git clone...
,cd ...
, andpip install ...
)run
: add a script step withpython train.py --data coco128.yaml
resources.accelerators
: add a "T4" as GPU.
You can always modify the spec file with new constraints or script steps.
The final file should look like this:
project:
name: { MY-PROJECT-NAME }
projectid: { MY-PROJECT-ID }
type: local
setup:
script:
- git clone https://github.com/ultralytics/yolov5
- cd yolov5
- pip install -r requirements.txt
run:
script:
- python train.py --data coco128.yaml
teardown: null
resources:
clouds: []
regions: []
areas: []
accelerators:
name: [T4]
compute_capability: ""
vram: 0
n_accelerators: 0
memory: 2
vcpus: 2
storage: 256
persistent_disk: 256
job:
autostop:
enabled: False
timeout_minutes: 10
notification:
email_on_success: True
email_on_failure: True
email_recipient: { MY-EMAIL }
io:
inputs: null
outputs: null
That's it! Wisp will, from this specification, be able to start a VM matching your requirements, copying your workspace to it, and submit the job. Let's try it!
Run the Job
Execute wisp run
in the project directory. You will now be presented with a number of
options for machines matching your configuration. Pick the one that fits your need, and
accept. Note that you will be charged while the instance is running.
You can track the provisioning progress in the Dashboard. Provisioning may take 1-5 minutes.
Once the VM has started, the job will be submitted to it. Your job will get an ID, which is printed on the terminal. The job is run in detached mode, which means you can exit the terminal or shut down your computer without interrupting it. Once done, the job is done, you will receive an email notification.
Attaching to your job
If you have detached from the job (exited the terminal), you can reattach to it by using
it's job id. If you lost the job id, you can find it in the dashboard, or with
wisp job list
. To attach to the job, execute wisp job attach { JOB_ID }
.
Debugging
In case the job fails on the remote server, you can always ssh into the VM and debug
live. Using wisp ssh
, Wisp will automatically connect to the currently active VM with
the generated credentials.
Extracting Weights
Once the job is completed, we can extract the computed checkpoints to the local machine. Normally, we would set up a bucket for storing the weights in the cloud, but to keep things simple, we'll simply download the weights to the local machine with scp:
scp user@ip:/path/to/workspace/yolov5/runs/train/exp/weights/best.pt ./