Skip to main content
Version: v2602

intelligence.zenith_tune.integration.kubernetes.pytorchjob_tuner

PyTorchJobTuner Objects

class PyTorchJobTuner(GeneralTuner)

Kubernetes PyTorchJob tuner that extends GeneralTuner for Kubernetes environments.

__init__

def __init__(job_name: str,
get_namespace: Optional[str] = None,
submit_namespace: Optional[str] = None,
output_dir: str = "outputs",
study_name: Optional[str] = None,
db_path: Optional[str] = None,
sampler: Optional[BaseSampler] = None,
maximize: bool = False,
timeout_per_trial: int = 86400,
wait_resources: bool = False,
polling_interval: int = 60)

Initialize the Kubernetes PyTorchJob tuning orchestrator.

Arguments:

  • job_name - Name of the PyTorchJob to use as original

  • get_namespace - Namespace to search for the original job (default: None uses current namespace)

  • submit_namespace - Target namespace for job submission (default: current namespace from kubeconfig)

  • output_dir - Directory to store study results

  • study_name - Name for the Optuna study

  • db_path - Path to the database file for Optuna study persistence

  • sampler - Sampler to use for optimization

  • maximize - Whether to maximize the objective function

  • timeout_per_trial - Timeout in seconds for each trial (default: 86400 = 24 hours)

  • wait_resources - Whether to wait for resources before each trial

  • polling_interval - Interval in seconds for polling checks

    Environment Variables:

  • ZENITHTUNE_K8S_IN_CLUSTER_DISABLE - Set to "1" to disable in-cluster config detection and force kubeconfig usage. Useful when running inside a Kubernetes cluster but needing to connect to a different cluster.

get_logs

def get_logs() -> str

Get logs from the original PyTorchJob (public API).

Returns:

Job logs as string

optimize

def optimize(job_converter: Callable[[Trial, PyTorchJob], PyTorchJob],
value_extractor: Callable[[str, PyTorchJob], float],
n_trials: int = 10,
default_params: Optional[Dict[str, Any]] = None)

Execute Kubernetes PyTorchJob tuning using GeneralTuner framework.

Arguments:

  • job_converter - Function to update job definition based on trial parameters Takes (Trial, PyTorchJob) and returns PyTorchJob
  • value_extractor - Function to extract objective value from log file Takes (log_file_path, job) and returns float value
  • n_trials - Number of trials to run
  • default_params - Default parameters for the first trial

Returns:

Tuple of (best_value, best_params) if successful, (None, None) otherwise