intelligence.zenith_tune.integration.kubernetes.pytorchjob_tuning_scheduler
Scheduler for automatic PyTorchJob discovery and tuning in Kubernetes.
TuningConfig Objects
@dataclass
class TuningConfig()
Configuration for a tuning job.
JobFilter Objects
@dataclass
class JobFilter()
Filter criteria for selecting PyTorchJobs to tune.
TuningRule Objects
@dataclass
class TuningRule()
A rule that maps a JobFilter to a TuningConfig.
When a job matches the filter, the associated tuning config is used.
PyTorchJobTuningScheduler Objects
class PyTorchJobTuningScheduler()
Scheduler that discovers PyTorchJobs and automatically creates tuning jobs.
This scheduler periodically scans for PyTorchJobs matching specified criteria and creates PyTorchJobTuner instances to optimize them.
__init__
def __init__(tuning_rules: List[TuningRule],
max_concurrent_tuning: Optional[int] = None,
max_concurrent_tuning_per_namespace: Optional[int] = None,
polling_interval: int = 60,
timeout_per_trial: int = 1209600)
Initialize the tuning scheduler.
Arguments:
tuning_rules- List of TuningRule for rule-based config selection (first match wins). Must not be empty. Jobs that match a rule's filter will be tuned using the rule's config.max_concurrent_tuning- Maximum number of concurrent tuning jobs (None = unlimited)max_concurrent_tuning_per_namespace- Maximum concurrent tuning jobs per namespace (None = unlimited)polling_interval- Interval in seconds for polling checks (default: 60)timeout_per_trial- Timeout in seconds for each trial (default: 1209600 = 2 weeks)
Raises:
ValueError- If tuning_rules is empty
from_yaml
@classmethod
def from_yaml(cls, config_path: str) -> "PyTorchJobTuningScheduler"
Create a PyTorchJobTuningScheduler from a YAML configuration file.
Arguments:
config_path- Path to the YAML configuration file
Returns:
Configured PyTorchJobTuningScheduler instance
Example YAML structure: scheduler:
-
max_concurrent_tuning- 5 -
max_concurrent_tuning_per_namespace- 2 -
polling_interval- 60 -
timeout_per_trial- 1209600tuning_rules:
- job_filter:
-
namespace_pattern- "production-.*" labels: -
team- "ml-team" tuning_config: -
n_trials- 20 -
output_dir- "production_outputs" -
maximize- true -
wait_resources- true- job_filter:
-
namespace_pattern- ".*" tuning_config: -
n_trials- 10 -
output_dir- "outputs"
run
def run()
Run the scheduler continuously.
shutdown
def shutdown()
Gracefully shutdown the scheduler.
This will:
- Signal all threads to stop
- Wait for active tuning jobs to complete
- Shutdown the executor