Skip to content
Snippets Groups Projects

Example - PyTorch Job

What is it about?

Distributed training with PyTorch. You can select level of parallelism, number of workers and resources for each worker (GPU, CPU).

How to run?

  • Navigate to: cd pytorchjob .
  • Build docker image with custom code:
    • docker build -f Dockerfile -t @username/pytorch-job . .
  • Push docker image: docker push @username/pytorch-job .
  • Edit pytorch.yaml (or other provided .yaml files).
    • Define PyTorch master and workers.
    • Make sure to run from the image you have just built image: registry.hub.docker.com/@username/pytorch-job .
  • In Terminal, run: kubectl apply -f pytorch.yaml.
  • Monitor pytorchjob with: kubectl logs pytorch-dist-mnist-nccl-master-0