PyTorchJobTuningScheduler Automated Tuning Examples
This document explains examples of automatically performing hyperparameter tuning on a Kubernetes cluster using PyTorchJobTuningScheduler
.
Here we demonstrate how to tune learning rate, batch size, and optimizer for an MNIST
training script included in intelligence/components/zenith-tune/examples/integration/kubernetes
using only annotation configuration.
Prerequisites
Before running the automated tuning example, prepare the following:
- A Kubernetes cluster environment where PyTorchJobs can be submitted and retrieved
- kubectl installed
- A Python environment with ZenithTune installed
If you don't have an accessible Kubernetes cluster, you can run the automated tuning example by building a cluster in your local environment using minikube.
# Start Minikube cluster and set context
minikube start
kubectl config use-context minikube
# Install Kubeflow Training Operator
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"
Files Used for Automated Tuning
Here's an explanation of the files used for automated tuning contained in FAIB's intelligence/components/zenith-tune/examples/integration/kubernetes
.
scheduler_example.py
This Python script launches PyTorchJobTuningScheduler
to execute annotation-based automated tuning.
The scheduler is configured to target only jobs with the zenith-tune/optimization-config
annotation:
from zenith_tune.integration.kubernetes import JobFilter, PyTorchJobTuningScheduler
# Only target jobs with zenith-tune/optimization-config annotation
job_filter = JobFilter(
annotations={"zenith-tune/optimization-config": None} # Key existence check
)
scheduler = PyTorchJobTuningScheduler(
submit_namespace="default",
job_filter=job_filter,
max_concurrent_tuning=1,
)
scheduler.run()
annotation_based_tuning.yaml
This YAML file defines a PyTorchJob containing annotation-based tuning configuration. The training script is defined in a ConfigMap and is designed to receive hyperparameters from command-line arguments.
Key features of the training script:
- Receives learning rate, batch size, and optimizer settings via command-line arguments
- Outputs the final loss value (tuning objective function)
PyTorchJob annotation configuration:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
generateName: annotation-based-tuning-
annotations:
zenith-tune/optimization-config: |
variables:
- name: "learning_rate"
type: "float"
range: [0.001, 0.1]
log: true
target_env: "LEARNING_RATE"
- name: "batch_size"
type: "int"
range: [16, 128]
step: 16
target_env: "BATCH_SIZE"
- name: "optimizer"
type: "categorical"
choices: ["sgd", "adam", "rmsprop"]
target_env: "OPTIMIZER"
objective:
name: "loss"
regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
direction: "minimize"
n_trials: 5
spec:
pytorchReplicaSpecs:
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
command:
- sh
- -c
- |
python /scripts/train_mnist_tunable.py \
--learning-rate ${LEARNING_RATE:-0.01} \
--batch-size ${BATCH_SIZE:-32} \
--optimizer ${OPTIMIZER:-sgd} \
--epochs 2
Running Automated Tuning
1. Starting the Scheduler
# Run the scheduler
python scheduler_example.py
2. Submitting the PyTorchJob
With the scheduler running, submit the PyTorchJob to be tuned to the cluster:
# Submit ConfigMap and PyTorchJob to the cluster
kubectl create -f annotation_based_tuning.yaml
3. Checking Tuning Status
# Check tuning jobs
kubectl get pytorchjobs
The tuning is successful when 5 consecutive tuning jobs are executed and hyperparameters are identified.