Skip to main content
Version: v2509

PyTorchJobTuningScheduler Automated Tuning Examples

This document explains examples of automatically performing hyperparameter tuning on a Kubernetes cluster using PyTorchJobTuningScheduler. Here we demonstrate how to tune learning rate, batch size, and optimizer for an MNIST training script included in intelligence/components/zenith-tune/examples/integration/kubernetes using only annotation configuration.

Prerequisites

Before running the automated tuning example, prepare the following:

  1. A Kubernetes cluster environment where PyTorchJobs can be submitted and retrieved
  2. kubectl installed
  3. A Python environment with ZenithTune installed

If you don't have an accessible Kubernetes cluster, you can run the automated tuning example by building a cluster in your local environment using minikube.

# Start Minikube cluster and set context
minikube start
kubectl config use-context minikube

# Install Kubeflow Training Operator
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

Files Used for Automated Tuning

Here's an explanation of the files used for automated tuning contained in FAIB's intelligence/components/zenith-tune/examples/integration/kubernetes.

scheduler_example.py

This Python script launches PyTorchJobTuningScheduler to execute annotation-based automated tuning.

The scheduler is configured to target only jobs with the zenith-tune/optimization-config annotation:

from zenith_tune.integration.kubernetes import JobFilter, PyTorchJobTuningScheduler

# Only target jobs with zenith-tune/optimization-config annotation
job_filter = JobFilter(
annotations={"zenith-tune/optimization-config": None} # Key existence check
)

scheduler = PyTorchJobTuningScheduler(
submit_namespace="default",
job_filter=job_filter,
max_concurrent_tuning=1,
)
scheduler.run()

annotation_based_tuning.yaml

This YAML file defines a PyTorchJob containing annotation-based tuning configuration. The training script is defined in a ConfigMap and is designed to receive hyperparameters from command-line arguments.

Key features of the training script:

  • Receives learning rate, batch size, and optimizer settings via command-line arguments
  • Outputs the final loss value (tuning objective function)

PyTorchJob annotation configuration:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: annotation-based-tuning-
annotations:
zenith-tune/optimization-config: |
variables:
- name: "learning_rate"
type: "float"
range: [0.001, 0.1]
log: true
target_env: "LEARNING_RATE"
- name: "batch_size"
type: "int"
range: [16, 128]
step: 16
target_env: "BATCH_SIZE"
- name: "optimizer"
type: "categorical"
choices: ["sgd", "adam", "rmsprop"]
target_env: "OPTIMIZER"
objective:
name: "loss"
regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
direction: "minimize"
n_trials: 5
spec:
pytorchReplicaSpecs:
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
command:
- sh
- -c
- |
python /scripts/train_mnist_tunable.py \
--learning-rate ${LEARNING_RATE:-0.01} \
--batch-size ${BATCH_SIZE:-32} \
--optimizer ${OPTIMIZER:-sgd} \
--epochs 2

Running Automated Tuning

1. Starting the Scheduler

# Run the scheduler
python scheduler_example.py

2. Submitting the PyTorchJob

With the scheduler running, submit the PyTorchJob to be tuned to the cluster:

# Submit ConfigMap and PyTorchJob to the cluster
kubectl create -f annotation_based_tuning.yaml

3. Checking Tuning Status

# Check tuning jobs
kubectl get pytorchjobs

The tuning is successful when 5 consecutive tuning jobs are executed and hyperparameters are identified.