Version: v2509

PyTorchJobTuningScheduler Automated Tuning Examples

This document explains examples of automatically performing hyperparameter tuning on a Kubernetes cluster using PyTorchJobTuningScheduler. Here we demonstrate how to tune learning rate, batch size, and optimizer for an MNIST training script included in intelligence/components/zenith-tune/examples/integration/kubernetes using only annotation configuration.

Prerequisites

Before running the automated tuning example, prepare the following:

A Kubernetes cluster environment where PyTorchJobs can be submitted and retrieved
kubectl installed
A Python environment with ZenithTune installed

If you don't have an accessible Kubernetes cluster, you can run the automated tuning example by building a cluster in your local environment using minikube.

# Start Minikube cluster and set context
minikube start
kubectl config use-context minikube

# Install Kubeflow Training Operator
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

Files Used for Automated Tuning

Here's an explanation of the files used for automated tuning contained in FAIB's intelligence/components/zenith-tune/examples/integration/kubernetes.

scheduler_example.py

This Python script launches PyTorchJobTuningScheduler to execute annotation-based automated tuning.

The scheduler is configured to target only jobs with the zenith-tune/optimization-config annotation:

from zenith_tune.integration.kubernetes import JobFilter, PyTorchJobTuningScheduler

# Only target jobs with zenith-tune/optimization-config annotation
job_filter = JobFilter(
    annotations={"zenith-tune/optimization-config": None}  # Key existence check
)

scheduler = PyTorchJobTuningScheduler(
    submit_namespace="default",
    job_filter=job_filter,
    max_concurrent_tuning=1,
)
scheduler.run()

annotation_based_tuning.yaml

This YAML file defines a PyTorchJob containing annotation-based tuning configuration. The training script is defined in a ConfigMap and is designed to receive hyperparameters from command-line arguments.

Key features of the training script:

Receives learning rate, batch size, and optimizer settings via command-line arguments
Outputs the final loss value (tuning objective function)

PyTorchJob annotation configuration:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  generateName: annotation-based-tuning-
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: "learning_rate"
          type: "float"
          range: [0.001, 0.1]
          log: true
          target_env: "LEARNING_RATE"
        - name: "batch_size"
          type: "int"
          range: [16, 128]
          step: 16
          target_env: "BATCH_SIZE"
        - name: "optimizer"
          type: "categorical"
          choices: ["sgd", "adam", "rmsprop"]
          target_env: "OPTIMIZER"
      objective:
        name: "loss"
        regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
        direction: "minimize"
      n_trials: 5
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
              command:
                - sh
                - -c
                - |
                  python /scripts/train_mnist_tunable.py \
                    --learning-rate ${LEARNING_RATE:-0.01} \
                    --batch-size ${BATCH_SIZE:-32} \
                    --optimizer ${OPTIMIZER:-sgd} \
                    --epochs 2

Running Automated Tuning

1. Starting the Scheduler

# Run the scheduler
python scheduler_example.py

2. Submitting the PyTorchJob

With the scheduler running, submit the PyTorchJob to be tuned to the cluster:

# Submit ConfigMap and PyTorchJob to the cluster
kubectl create -f annotation_based_tuning.yaml

3. Checking Tuning Status

# Check tuning jobs
kubectl get pytorchjobs

The tuning is successful when 5 consecutive tuning jobs are executed and hyperparameters are identified.

Prerequisites​

Files Used for Automated Tuning​

scheduler_example.py​

annotation_based_tuning.yaml​

Running Automated Tuning​

1. Starting the Scheduler​

2. Submitting the PyTorchJob​

3. Checking Tuning Status​

Related Documentation​