Skip to main content
Version: v2509

Job Tuning Example with PyTorchJobTuner

This document explains an example of actually tuning a PyTorchJob on a Kubernetes cluster using PyTorchJobTuner. Here, we demonstrate how to reduce training time by tuning the number of data loader workers and OpenMP threads for the MNIST training script included in intelligence/components/zenith-tune/examples/integration/kubernetes.

Prerequisites

Before executing the job tuning example, please prepare the following:

  1. A Kubernetes cluster environment where PyTorchJobs can be submitted and retrieved
  2. kubectl installed
  3. A Python environment with ZenithTune installed

If you don't have access to a Kubernetes cluster, you can execute the job tuning example by setting up a local cluster using minikube.

# Start Minikube cluster and set context
minikube start
kubectl config use-context minikube

# Install Kubeflow Training Operator
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

Files Used in Job Tuning

This section explains the files included in FAIB's intelligence/components/zenith-tune/examples/integration/kubernetes that are used for job tuning.

job_training.yaml

This YAML file defines a PyTorchJob that runs a simple MNIST training script. Note that the training script is defined in a ConfigMap.

Main features of the training script:

  • Accepts a --num-workers argument to configure DataLoader workers
  • Reads the OMP_NUM_THREADS environment variable to configure PyTorch threads
  • Outputs training time

tune_job_training.py

This Python script uses PyTorchJobTuner to tune the num-workers and OMP_NUM_THREADS for the MNIST training defined in job_training.yaml.

First, PyTorchJobTuner receives the target job name job_name and its namespace get_namespace to identify the job. submit_namespace is the namespace for submitting jobs and can be set independently from get_namespace. db_path is the database path specified to resume and analyze existing job tuning results.

    tuner = PyTorchJobTuner(
job_name=args.job_name,
get_namespace=args.namespace,
submit_namespace=args.namespace,
db_path=args.db_path,
)

Next, the user defines a function job_converter to convert an existing PyTorchJob into a tuning job and a function value_extractor to extract the objective value, then executes job tuning by passing them to the optimize function. Implementation examples of job_converter and value_extractor are explained below.

    tuner.optimize(
job_converter=job_converter,
value_extractor=value_extractor,
n_trials=args.n_trials,
)

Implementation Example of job_converter

In this script's job_converter, tuning parameters omp_num_threads and num_workers are defined, with omp_num_threads set as an environment variable and num_workers set as a command argument. The set_env function of PyTorchJob is used to set environment variables.

def job_converter(trial: Trial, job: PyTorchJob) -> PyTorchJob:
"""
Update job definition with different OMP_NUM_THREADS and num_workers settings.

Args:
trial: Optuna trial object for suggesting parameters
job: PyTorchJob object to update

Returns:
Updated PyTorchJob object
"""
# Suggest number of threads and workers
num_threads = trial.suggest_int("omp_num_threads", 1, 8)
num_workers = trial.suggest_int("num_workers", 0, 4)

# Set environment variable using convenient API
job.set_env("OMP_NUM_THREADS", str(num_threads))

...

To add num_workers as a command argument, we use the CommandBuilder class to modify the command line. For this, we retrieve the original command using PyTorchJob.get_command, but since PyTorchJob commands are expressed in array format, we extract only the target command string. In this example, since the command is stored as the second element in the array like ['sh', '-c', 'command'], we only provide the second string to CommandBuilder.

Then, we use the append function of CommandBuilder to add the --num-workers option. In this case, we use the append function because we know the existing command doesn't include the --num-workers option, but if you want to update the value of an already existing option, use the update function.

    ...

# Update command to include num_workers argument using CommandBuilder
current_command = job.get_command()
assert (
current_command
and len(current_command) >= 3
and current_command[0] == "sh"
and current_command[1] == "-c"
), f"Expected ['sh', '-c', 'command'] format, got: {current_command}"

# Modify only the actual command part (index 2)
actual_command = current_command[2]
builder = CommandBuilder(actual_command)
builder.append(f"--num-workers {num_workers}")
...

After that, we set the new command with PyTorchJob.set_command and return the updated job to complete the implementation of job_converter.

    ...

# Replace the command part while keeping sh -c wrapper
new_command = current_command.copy()
new_command[2] = builder.get_command()
job.set_command(new_command)

print(
f"Trial {trial.number}: OMP_NUM_THREADS={num_threads}, num_workers={num_workers}"
)

return job

Implementation Example of value_extractor

Similar to the usage of CommandOutputTuner, value_extractor receives the log file path as the first argument, so we implement the process to extract the objective value from the log. In this case, since the training time is output after Elapsed time:, we extract and return this value. If value extraction fails due to training failure or job interruption, return None to reject that Trial.

def value_extractor(log_path: str) -> Optional[float]:
"""Extract objective value from log file."""
with open(log_path, "r") as f:
logs = f.read()

# Look for the line with elapsed time
match = re.search(r"Elapsed time: ([0-9.]+) seconds", logs)
if match:
elapsed_time = float(match.group(1))
return elapsed_time
else:
print(f"Could not find elapsed time in {log_path}")
return None

Executing Job Tuning

Deploy the PyTorchJob defined in job_training.yaml to the Kubernetes cluster as an existing job for executing job tuning.

kubectl create -f job_training.yaml
  • If you encounter an error like Error from server (AlreadyExists): error when creating "job_training.yaml": configmaps "training-script" already exists, you can ignore it as job_training.yaml has already been created.
  • The initial create may take several tens of minutes to pull the Docker image.

Check the name of the deployed PyTorchJob and the training time.

$ kubectl get pytorchjobs
NAME STATUS AGE
job-training-76kxt Succeeded 1m
$ kubectl logs job-training-76kxt-worker-0
...
Elapsed time: 78.9675 seconds

Then, execute the tuning script targeting the job confirmed above. We set the number of trials to 5.

python tune_job_training.py --job-name job-training-76kxt --n-trials 5

When tuning is complete, logs like the following will be output (some zenith-tune log output is omitted). The tuning results show that the combination of {'omp_num_threads': 1, 'num_workers': 0} is the fastest.

[I 2025-08-25 14:27:10,064] Trial 0 finished with value: 15.9339 and parameters: {'omp_num_threads': 1, 'num_workers': 0}. Best is trial 0 with value: 15.9339.
[I 2025-08-25 14:28:14,611] Trial 1 finished with value: 27.2895 and parameters: {'omp_num_threads': 2, 'num_workers': 4}. Best is trial 0 with value: 15.9339.
[I 2025-08-25 14:29:37,069] Trial 2 finished with value: 58.804 and parameters: {'omp_num_threads': 4, 'num_workers': 3}. Best is trial 0 with value: 15.9339.
[I 2025-08-25 14:30:38,995] Trial 3 finished with value: 47.4847 and parameters: {'omp_num_threads': 3, 'num_workers': 4}. Best is trial 0 with value: 15.9339.
[I 2025-08-25 14:32:03,641] Trial 4 finished with value: 30.1066 and parameters: {'omp_num_threads': 3, 'num_workers': 0}. Best is trial 0 with value: 15.9339.
2025-08-25 14:32:03,653 - zenith-tune - INFO - Best trial: trial_id=1, value=15.9339, params={'omp_num_threads': 1, 'num_workers': 0}

Applying Tuning Results

By applying the parameters obtained through tuning to the YAML file, future training runs will be accelerated.

Apply the parameters to job_training.yaml.

    command:
- sh
- -c
- |
- python /scripts/train_mnist.py
+ OMP_NUM_THREADS=1 python /scripts/train_mnist.py --num-workers 0

Execute the PyTorchJob with the applied parameters and confirm that it is actually faster.

# Ignore configmaps "training-script" already exists
$ kubectl create -f job_training.yaml
pytorchjob.kubeflow.org/job-training-nbmtz created
Error from server (AlreadyExists): error when creating "job_training_new.yaml": configmaps "training-script" already exists

# After execution completes
$ kubectl logs job-training-nbmtz-worker-0
...
Elapsed time: 16.0635 seconds

The training completed in a time similar to the optimal value during tuning 15.9339 seconds, and job tuning achieved approximately 4.9x speedup compared to the original training time of 78.9675 seconds.

In this way, by using PyTorchJobTuner provided by ZenithTune, you can efficiently discover optimal hyperparameters without manual trial and error and reduce training time.