Skip to main content
Version: v2512

intelligence.zenith_tune.integration.kubernetes.pytorchjob_tuning_scheduler

Scheduler for automatic PyTorchJob discovery and tuning in Kubernetes.

TuningConfig Objects

@dataclass
class TuningConfig()

Configuration for a tuning job.

JobFilter Objects

@dataclass
class JobFilter()

Filter criteria for selecting PyTorchJobs to tune.

TuningRule Objects

@dataclass
class TuningRule()

A rule that maps a JobFilter to a TuningConfig.

When a job matches the filter, the associated tuning config is used.

PyTorchJobTuningScheduler Objects

class PyTorchJobTuningScheduler()

Scheduler that discovers PyTorchJobs and automatically creates tuning jobs.

This scheduler periodically scans for PyTorchJobs matching specified criteria and creates PyTorchJobTuner instances to optimize them.

__init__

def __init__(tuning_rules: List[TuningRule],
max_concurrent_tuning: Optional[int] = None,
max_concurrent_tuning_per_namespace: Optional[int] = None,
polling_interval: int = 60,
timeout_per_trial: int = 1209600)

Initialize the tuning scheduler.

Arguments:

  • tuning_rules - List of TuningRule for rule-based config selection (first match wins). Must not be empty. Jobs that match a rule's filter will be tuned using the rule's config.
  • max_concurrent_tuning - Maximum number of concurrent tuning jobs (None = unlimited)
  • max_concurrent_tuning_per_namespace - Maximum concurrent tuning jobs per namespace (None = unlimited)
  • polling_interval - Interval in seconds for polling checks (default: 60)
  • timeout_per_trial - Timeout in seconds for each trial (default: 1209600 = 2 weeks)

Raises:

  • ValueError - If tuning_rules is empty

from_yaml

@classmethod
def from_yaml(cls, config_path: str) -> "PyTorchJobTuningScheduler"

Create a PyTorchJobTuningScheduler from a YAML configuration file.

Arguments:

  • config_path - Path to the YAML configuration file

Returns:

Configured PyTorchJobTuningScheduler instance

Example YAML structure: scheduler:

  • max_concurrent_tuning - 5

  • max_concurrent_tuning_per_namespace - 2

  • polling_interval - 60

  • timeout_per_trial - 1209600

    tuning_rules:

    • job_filter:
  • namespace_pattern - "production-.*" labels:

  • team - "ml-team" tuning_config:

  • n_trials - 20

  • output_dir - "production_outputs"

  • maximize - true

  • wait_resources - true

    • job_filter:
  • namespace_pattern - ".*" tuning_config:

  • n_trials - 10

  • output_dir - "outputs"

run

def run()

Run the scheduler continuously.

shutdown

def shutdown()

Gracefully shutdown the scheduler.

This will:

  1. Signal all threads to stop
  2. Wait for active tuning jobs to complete
  3. Shutdown the executor