Version: v2602

intelligence.zenith_tune.integration.kubernetes.pytorchjob_tuning_scheduler

Scheduler for automatic PyTorchJob discovery and tuning in Kubernetes.

TuningConfig Objects

@dataclass
class TuningConfig()

Configuration for a tuning job.

JobFilter Objects

@dataclass
class JobFilter()

Filter criteria for selecting PyTorchJobs to tune.

TuningRule Objects

@dataclass
class TuningRule()

A rule that maps a JobFilter to a TuningConfig.

When a job matches the filter, the associated tuning config is used.

PyTorchJobTuningScheduler Objects

class PyTorchJobTuningScheduler()

Scheduler that discovers PyTorchJobs and automatically creates tuning jobs.

This scheduler periodically scans for PyTorchJobs matching specified criteria and creates PyTorchJobTuner instances to optimize them.

init

def __init__(tuning_rules: List[TuningRule],
             max_concurrent_tuning: Optional[int] = None,
             max_concurrent_tuning_per_namespace: Optional[int] = None,
             polling_interval: int = 60,
             timeout_per_trial: int = 1209600)

Initialize the tuning scheduler.

Arguments:

tuning_rules - List of TuningRule for rule-based config selection (first match wins). Must not be empty. Jobs that match a rule's filter will be tuned using the rule's config.
max_concurrent_tuning - Maximum number of concurrent tuning jobs (None = unlimited)
max_concurrent_tuning_per_namespace - Maximum concurrent tuning jobs per namespace (None = unlimited)
polling_interval - Interval in seconds for polling checks (default: 60)
timeout_per_trial - Timeout in seconds for each trial (default: 1209600 = 2 weeks)

Raises:

ValueError - If tuning_rules is empty

from_yaml

@classmethod
def from_yaml(cls, config_path: str) -> "PyTorchJobTuningScheduler"

Create a PyTorchJobTuningScheduler from a YAML configuration file.

Arguments:

config_path - Path to the YAML configuration file

Returns:

Configured PyTorchJobTuningScheduler instance

Example YAML structure: scheduler:

max_concurrent_tuning - 5
max_concurrent_tuning_per_namespace - 2
polling_interval - 60
timeout_per_trial - 1209600

tuning_rules:
- job_filter:
namespace_pattern - "production-.*" labels:
team - "ml-team" tuning_config:
n_trials - 20
output_dir - "production_outputs"
maximize - true
wait_resources - true
- job_filter:
namespace_pattern - ".*" tuning_config:
n_trials - 10
output_dir - "outputs"

run

def run()

Run the scheduler continuously.

shutdown

def shutdown()

Gracefully shutdown the scheduler.

This will:

Signal all threads to stop
Wait for active tuning jobs to complete
Shutdown the executor

TuningConfig Objects​

JobFilter Objects​

TuningRule Objects​

PyTorchJobTuningScheduler Objects​

__init__​

from_yaml​

run​

shutdown​