Version: v2509

PyTorchJob Automated Tuning Scheduler

ZenithTune provides a scheduler functionality (PyTorchJobTuningScheduler) that automatically discovers PyTorchJobs and performs hyperparameter tuning in Kubernetes environments. This feature automates job tuning on clusters using only YAML annotation configuration, without writing code.

How to Use PyTorchJobTuningScheduler

PyTorchJobTuningScheduler operates with the following workflow:

This scheduler automatically monitors newly created PyTorchJobs on the Kubernetes cluster and automatically starts hyperparameter tuning for jobs that meet specific conditions (annotations).

Annotation-Based Tuning

The key feature of PyTorchJobTuningScheduler is that hyperparameter tuning starts automatically just by adding optimization configuration to the PyTorchJob YAML annotations.

1. Starting the Scheduler

Start PyTorchJobTuningScheduler as follows:

from zenith_tune.integration.kubernetes import JobFilter, PyTorchJobTuningScheduler

# Only target jobs with zenith-tune/optimization-config annotation
job_filter = JobFilter(
    annotations={"zenith-tune/optimization-config": None}  # Key existence check
)

scheduler = PyTorchJobTuningScheduler(
    submit_namespace="default",
    job_filter=job_filter,
    max_concurrent_tuning=1,
)
scheduler.run()

2. Preparing Jobs for Tuning

Once the scheduler is running, add the zenith-tune/optimization-config annotation to the PyTorchJob you want to tune:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  generateName: my-training-
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: "learning_rate"
          type: "float"
          range: [0.001, 0.1]
          log: true
          target_env: "LEARNING_RATE"
        - name: "batch_size"
          type: "int"
          range: [16, 128]
          step: 16
          target_env: "BATCH_SIZE"
      objective:
        name: "loss"
        regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
        direction: "minimize"
      n_trials: 10
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:latest
              command:
                - python
                - train.py
                - --learning-rate
                - ${LEARNING_RATE:-0.01}
                - --batch-size
                - ${BATCH_SIZE:-32}

Important: The environment variable names specified in target_env (e.g., LEARNING_RATE, BATCH_SIZE) will be automatically set with optimized values for each trial during tuning execution. In the example above, we reference environment variables like ${LEARNING_RATE:-0.01} and also specify default values (0.01) when values are not set.

3. Starting Automatic Tuning

With the scheduler running, when you submit a PyTorchJob with the zenith-tune/optimization-config annotation to the cluster, tuning starts automatically.

The scheduler detects newly created jobs and automatically performs hyperparameter tuning based on the annotations.

Advanced Configuration

Job Selection with JobFilter

You can use JobFilter to narrow down tuning targets with more detailed conditions.

# Target only jobs in specific namespaces
job_filter = JobFilter(
    namespace_pattern=r"^ml-.*",  # Namespaces starting with "ml-"
    annotations={"zenith-tune/optimization-config": None}
)

# Target only jobs with specific labels
job_filter = JobFilter(
    labels={"team": "ai-research"},
    annotations={"zenith-tune/optimization-config": None}
)

# Combination of multiple conditions
job_filter = JobFilter(
    namespace_pattern=r"^production-.*",
    labels={"environment": "production"},
    annotations={"zenith-tune/optimization-config": None}
)

Adjusting Concurrent Execution Count

Adjust the concurrent execution count according to cluster resources with the max_concurrent_tuning parameter:

scheduler = PyTorchJobTuningScheduler(
    submit_namespace="default",
    tuning_config=tuning_config,
    job_filter=job_filter,
    max_concurrent_tuning=5,  # Run up to 5 tuning jobs simultaneously
)

Detailed Annotation Configuration Specification

Here's the precise specification for the YAML format of the zenith-tune/optimization-config annotation.

Overall Structure

metadata:
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: <string>
          type: <string>
          target_env: <string>
          # type-specific fields
      objective:
        name: <string>
        regex: <string>
        direction: <string>
      n_trials: <integer>  # Optional

variables (Tuning Parameters)

Available fields for each variable:

Field	Type	Required	Description
`name`	string	Yes	Unique identifier for the variable used in Optuna trials
`type`	string	Yes	Variable type: `float`, `int`, or `categorical`
`target_env`	string	No	Environment variable name to set in Worker containers (uses uppercase version of `name` if not specified)

Type-Specific Fields

For Float type:

Field	Type	Required	Description
`range`	[float, float]	Yes	Minimum and maximum values [low, high]
`log`	boolean	No	Use logarithmic scale (default: false)

- name: "learning_rate"
  type: "float"
  range: [0.0001, 0.1]
  log: true
  target_env: "LEARNING_RATE"

For Integer type:

Field	Type	Required	Description
`range`	[int, int]	Yes	Minimum and maximum values [low, high]
`step`	int	No	Step size (default: 1)

- name: "batch_size"
  type: "int"
  range: [16, 128]
  step: 16
  target_env: "BATCH_SIZE"

For Categorical type:

Field	Type	Required	Description
`choices`	list	Yes	List of possible values

- name: "optimizer"
  type: "categorical"
  choices: ["adam", "sgd", "adamw"]
  target_env: "OPTIMIZER_TYPE"

objective (Optimization Target)

Fields used in the objective section:

Field	Type	Required	Description
`name`	string	Yes	Human-readable metric name
`regex`	string	Yes	Regular expression pattern to extract values
`direction`	string	Yes	Optimization direction: `minimize` or `maximize`

Regular Expression Pattern Rules

Requires one capture group () that captures a numeric value
The capture group must match floating-point numbers: ([0-9]+\.?[0-9]*)
Special characters should be properly escaped

objective:
  name: "validation_loss"
  regex: "Validation Loss: ([0-9]+\\.?[0-9]*)"
  direction: "minimize"

How to Stop Tuning

To stop tuning in progress, delete the original PyTorchJob:

kubectl delete pytorchjob my-training-xxxxx

When the original job is deleted, the tuning process for that job is automatically interrupted, and generation of new trial jobs stops. Trial jobs already in execution will run to completion, but new trials will not start.

Notes and Best Practices

Jobs that existed before the scheduler started are automatically excluded
Jobs generated by tuning (with the zenith-tune/created-by: PyTorchJobTuner annotation) will not be tuning targets again
Set max_concurrent_tuning appropriately according to cluster resource capacity
If the YAML format of annotations is incorrect, that job will be skipped

How to Use PyTorchJobTuningScheduler​

Annotation-Based Tuning​

1. Starting the Scheduler​

2. Preparing Jobs for Tuning​

3. Starting Automatic Tuning​

Advanced Configuration​

Job Selection with JobFilter​

Adjusting Concurrent Execution Count​

Detailed Annotation Configuration Specification​

Overall Structure​

variables (Tuning Parameters)​

Type-Specific Fields​

objective (Optimization Target)​

Regular Expression Pattern Rules​

How to Stop Tuning​

Notes and Best Practices​

Related Documentation​