Version: v2512

PyTorchJob Automated Tuning Scheduler

ZenithTune provides a scheduler functionality (PyTorchJobTuningScheduler) that automatically discovers PyTorchJobs and performs hyperparameter tuning in Kubernetes environments. This feature automates job tuning on clusters using only YAML annotation configuration, without writing code.

How to Use PyTorchJobTuningScheduler

PyTorchJobTuningScheduler operates with the following workflow:

This scheduler automatically monitors newly created PyTorchJobs on the Kubernetes cluster and automatically starts hyperparameter tuning for jobs that meet specific conditions (annotations).

Annotation-Based Tuning

The key feature of PyTorchJobTuningScheduler is that hyperparameter tuning starts automatically just by adding optimization configuration to the PyTorchJob YAML annotations.

1. Starting the Scheduler

Start PyTorchJobTuningScheduler as follows:

from aibooster.intelligence.zenith_tune.integration.kubernetes import (
    JobFilter, PyTorchJobTuningScheduler, TuningConfig, TuningRule
)

# Define tuning rules
tuning_rules = [
    TuningRule(
        job_filter=JobFilter(),
        tuning_config=TuningConfig(),
    )
]

scheduler = PyTorchJobTuningScheduler(
    tuning_rules=tuning_rules,
    max_concurrent_tuning_per_namespace=1,  # Maximum 1 per namespace
)
scheduler.run()

2. Preparing Jobs for Tuning

Once the scheduler is running, add the zenith-tune/optimization-config annotation to the PyTorchJob you want to tune:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  generateName: my-training-
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: "learning_rate"
          type: "float"
          range: [0.001, 0.1]
          log: true
          target_env: "LEARNING_RATE"
        - name: "batch_size"
          type: "int"
          range: [16, 128]
          step: 16
          target_env: "BATCH_SIZE"
      objective:
        name: "loss"
        regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
        direction: "minimize"
      n_trials: 10
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch/pytorch:latest
              command:
                - python
                - train.py
                - --learning-rate
                - ${LEARNING_RATE:-0.01}
                - --batch-size
                - ${BATCH_SIZE:-32}

Important: The environment variable names specified in target_env (e.g., LEARNING_RATE, BATCH_SIZE) will be automatically set with optimized values for each trial during tuning execution. In the example above, we reference environment variables like ${LEARNING_RATE:-0.01} and also specify default values (0.01) when values are not set.

3. Starting Automatic Tuning

With the scheduler running, when you submit a PyTorchJob with the zenith-tune/optimization-config annotation to the cluster, tuning starts automatically.

The scheduler detects newly created jobs and automatically performs hyperparameter tuning based on the annotations.

4. How to Stop Tuning

If you want to stop a running tuning process, delete the original PyTorchJob:

kubectl delete pytorchjob my-training-xxxxx

When the original job is deleted, the tuning process for that job is automatically interrupted and no new trial jobs will be created. Already running trial jobs will complete their execution, but no new trials will be started.

Detailed Annotation Configuration Specification

Here's the precise specification for the YAML format of the zenith-tune/optimization-config annotation.

Overall Structure

metadata:
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: <string>
          type: <string>
          target_env: <string>
          # type-specific fields
      objective:
        name: <string>
        regex: <string>
        direction: <string>
      n_trials: <integer>  # Optional

variables (Tuning Parameters)

Available fields for each variable:

Field	Type	Required	Description
`name`	string	Yes	Unique identifier for the variable used in Optuna trials
`type`	string	Yes	Variable type: `float`, `int`, or `categorical`
`target_env`	string	No	Environment variable name to set in Worker containers (uses uppercase version of `name` if not specified)

Type-Specific Fields

For Float type:

Field	Type	Required	Description
`range`	[float, float]	Yes	Minimum and maximum values [low, high]
`log`	boolean	No	Use logarithmic scale (default: false)

- name: "learning_rate"
  type: "float"
  range: [0.0001, 0.1]
  log: true
  target_env: "LEARNING_RATE"

For Integer type:

Field	Type	Required	Description
`range`	[int, int]	Yes	Minimum and maximum values [low, high]
`step`	int	No	Step size (default: 1)

- name: "batch_size"
  type: "int"
  range: [16, 128]
  step: 16
  target_env: "BATCH_SIZE"

For Categorical type:

Field	Type	Required	Description
`choices`	list	Yes	List of possible values

- name: "optimizer"
  type: "categorical"
  choices: ["adam", "sgd", "adamw"]
  target_env: "OPTIMIZER_TYPE"

objective (Optimization Target)

Fields used in the objective section:

Field	Type	Required	Description
`name`	string	Yes	Human-readable metric name
`regex`	string	Yes	Regular expression pattern to extract values
`direction`	string	Yes	Optimization direction: `minimize` or `maximize`
`selector`	string	No	Value selection method when multiple matches occur (default: `last`)

Regular Expression Pattern Rules

Requires one capture group () that captures a numeric value
The capture group must match floating-point numbers: ([0-9]+\.?[0-9]*)
Special characters should be properly escaped

selector Option

Specifies the selection method when the regular expression pattern matches multiple values:

Value	Description
`first`	Use the first matched value
`last`	Use the last matched value (default)
`min`	Use the minimum value among matches
`max`	Use the maximum value among matches
`mean`	Use the mean value of matches
`median`	Use the median value of matches

Example configuration to minimize the final validation loss:

objective:
  name: "validation_loss"
  regex: "Validation Loss: ([0-9]+\\.?[0-9]*)"
  direction: "minimize"
  selector: "last"

custom_patches (Custom Patches)

Use custom_patches to apply arbitrary patches to trial job manifests. This allows you to configure priorityClassName, nodeSelector, and other settings.

metadata:
  annotations:
    zenith-tune/optimization-config: |
      variables:
        - name: "learning_rate"
          type: "float"
          range: [0.001, 0.1]
          target_env: "LEARNING_RATE"
      objective:
        name: "loss"
        regex: "Final Loss: ([0-9]+\\.?[0-9]*)"
        direction: "minimize"
      custom_patches:
        spec:
          pytorchReplicaSpecs:
            Worker:
              template:
                spec:
                  priorityClassName: low-priority

Advanced Configuration

Job Routing with TuningRule

PyTorchJob Tuning Scheduler allows you to apply different tuning configurations based on job type or purpose. For example, you can configure more trials for training jobs and fewer trials for test jobs, or define different optimization target hyperparameters and objective functions per job. Such job-specific routing is achieved using TuningRule.

TuningRule is a rule that combines JobFilter (defining job conditions) and TuningConfig (defining tuning settings). When multiple rules are defined, the first matching rule is used.

tuning_rules = [
    # Training jobs: more trials
    TuningRule(
        job_filter=JobFilter(
            name_pattern=r"training-.*",
            annotations={"zenith-tune/optimization-config": None}
        ),
        tuning_config=TuningConfig(
            n_trials=20,
        )
    ),
    # Test jobs: fewer trials
    TuningRule(
        job_filter=JobFilter(
            name_pattern=r"test-.*",
            annotations={"zenith-tune/optimization-config": None}
        ),
        tuning_config=TuningConfig(
            n_trials=5,
        )
    ),
    # Others: default configuration
    TuningRule(
        job_filter=JobFilter(
            annotations={"zenith-tune/optimization-config": None}
        ),
        tuning_config=TuningConfig()
    ),
]

scheduler = PyTorchJobTuningScheduler(tuning_rules=tuning_rules)
scheduler.run()

JobFilter

JobFilter defines conditions for jobs to be tuning targets:

job_filter = JobFilter(
    namespace_pattern=r"^ml-.*",  # Namespace regex pattern
    name_pattern=r"training-.*",  # Job name regex pattern
    labels={"team": "ai-research"},  # Labels (exact match)
    annotations={"zenith-tune/optimization-config": None}  # Annotations (None = key existence check)
)

TuningConfig

TuningConfig defines tuning execution settings:

tuning_config = TuningConfig(
    n_trials=20,                             # Number of trials (None to read from annotations)
    output_dir="outputs",                    # Log output directory
    maximize=False,                          # Whether to maximize objective value
    wait_resources=True,                     # Enable resource waiting
    submit_namespace="tuning-ns",            # Namespace for trial job submission (None inherits from original)
    default_params={"learning_rate": 0.01},  # Default parameters for first trial
    default_custom_patches=None,             # Custom patches to apply to trial jobs
    sampler=None,                            # Optuna sampler
)

Use default_custom_patches to customize trial job Kubernetes manifests (see custom_patches). For example, to assign trial jobs to a specific LocalQueue when using Kueue:

tuning_config = TuningConfig(
    default_custom_patches={
        "metadata": {
            "labels": {
                "kueue.x-k8s.io/queue-name": "tuning-queue"
            }
        }
    },
)

Note: If custom_patches is set in job annotations, the annotation settings take precedence and are merged.

Adjusting Concurrent Execution Count

Two parameters are available for adjusting concurrent execution count:

max_concurrent_tuning: Maximum number of tuning processes that can run simultaneously across the cluster (None for unlimited)
max_concurrent_tuning_per_namespace: Maximum number of tuning processes that can run simultaneously per namespace (None for unlimited)

scheduler = PyTorchJobTuningScheduler(
    tuning_rules=tuning_rules,
    max_concurrent_tuning=10,             # Maximum 10 across the cluster
    max_concurrent_tuning_per_namespace=2,  # Maximum 2 per namespace
)

Tuning Metadata Environment Variables

The following environment variables are automatically set in trial jobs:

Environment Variable	Description
`ZENITH_TUNE_ENABLED`	Indicates tuning is enabled (value: `1`)
`ZENITH_TUNE_TRIAL_ID`	Trial number (starting from 0)
`ZENITH_TUNE_PARAMS_LABEL`	Parameter label string (e.g., `learning_rate=0.01,batch_size=32`)

These environment variables can be referenced from within trial jobs for logging or monitoring purposes.

YAML Configuration File

Scheduler configuration can also be written in YAML files.

from aibooster.intelligence.zenith_tune.integration.kubernetes import PyTorchJobTuningScheduler

scheduler = PyTorchJobTuningScheduler.from_yaml("config.yaml")
scheduler.run()

Example configuration file (config.yaml):

scheduler:
  max_concurrent_tuning_per_namespace: 1  # Per-namespace concurrent tuning limit

tuning_rules:
  - job_filter:
      annotations:
        zenith-tune/optimization-config: null
    tuning_config:
      default_custom_patches:
        spec:
          pytorchReplicaSpecs:
            Worker:
              template:
                spec:
                  priorityClassName: low-priority

Notes and Best Practices

Jobs that existed before the scheduler started are automatically excluded
Jobs generated by tuning (with the zenith-tune/created-by: PyTorchJobTuner annotation) will not be tuning targets again
Set max_concurrent_tuning and max_concurrent_tuning_per_namespace appropriately according to cluster resource capacity
If the YAML format of annotations is incorrect, that job will be skipped

How to Use PyTorchJobTuningScheduler​

Annotation-Based Tuning​

1. Starting the Scheduler​

2. Preparing Jobs for Tuning​

3. Starting Automatic Tuning​

4. How to Stop Tuning​

Detailed Annotation Configuration Specification​

Overall Structure​

variables (Tuning Parameters)​

Type-Specific Fields​

objective (Optimization Target)​

Regular Expression Pattern Rules​

selector Option​

custom_patches (Custom Patches)​

Advanced Configuration​

Job Routing with TuningRule​

JobFilter​

TuningConfig​

Adjusting Concurrent Execution Count​

Tuning Metadata Environment Variables​

YAML Configuration File​

Notes and Best Practices​

Related Documentation​