Skip to main content
Version: v2509

Auto Pruning

ZenithTune provides auto pruners that automatically prune trials during tuning. This feature allows you to detect and automatically prune trials that take too long to execute or consume excessive GPU memory.

Auto pruners monitor specific conditions during command execution and automatically stop trials when the configured thresholds are exceeded (or fall below). This helps avoid out-of-memory errors, prevent excessively time-consuming executions, efficiently utilize resources, and reduce tuning time by early termination of unpromising trials.

Pruner Types

TimeoutPruner

Prunes trials that exceed the specified time limit.

from zenith_tune.auto_pruners import TimeoutPruner
from zenith_tune.tuners import CommandOutputTuner

# Pruner that times out after 300 seconds
timeout_pruner = TimeoutPruner(timeout_seconds=300.0)

tuner = CommandOutputTuner(
auto_pruners=[timeout_pruner]
)

AIBooster Integration Pruners

Pruners that integrate with AIBooster server to monitor GPU metrics.

AIBoosterGPUUtilizationPruner

Prunes trials when GPU utilization falls below the threshold.

from zenith_tune.auto_pruners import AIBoosterGPUUtilizationPruner
from zenith_tune.tuners import CommandOutputTuner

# Prune when GPU utilization falls below 5%
gpu_util_pruner = AIBoosterGPUUtilizationPruner(
aibooster_server_address="http://localhost:16697",
threshold=5.0, # 5% GPU utilization
)

tuner = CommandOutputTuner(
auto_pruners=[gpu_util_pruner]
)

AIBoosterGPUMemoryUsedPruner

Prunes trials when GPU memory usage exceeds the threshold.

from zenith_tune.auto_pruners import AIBoosterGPUMemoryUsedPruner
from zenith_tune.tuners import CommandOutputTuner
import socket

current_hostname = socket.gethostname()

# Prune when GPU memory usage exceeds 12800MB (12.8GB)
gpu_memory_pruner = AIBoosterGPUMemoryUsedPruner(
aibooster_server_address="http://localhost:16697",
threshold=12800.0, # 12800 MB
agent_gpu_filter={
current_hostname: [0], # Monitor GPU 0 on the host
},
)

tuner = CommandOutputTuner(
auto_pruners=[gpu_memory_pruner]
)

AIBoosterDCGMMetricsPruner (Custom Pruner)

In addition to the dedicated pruners above, you can use AIBoosterDCGMMetricsPruner directly to customize specific DCGM metrics and conditions.

from zenith_tune.auto_pruners import AIBoosterDCGMMetricsPruner
from zenith_tune.tuners import CommandOutputTuner

# Custom condition: Prune when GPU power exceeds 200W
power_pruner = AIBoosterDCGMMetricsPruner(
aibooster_server_address="http://localhost:16697",
metric_name="DCGM_FI_DEV_POWER_USAGE", # Power usage metric
threshold=200.0, # 200W
prune_when="above", # When exceeding the threshold
reduction="max", # Use maximum value for judgment
check_interval=10.0,
warmup_duration=60.0,
)

tuner = CommandOutputTuner(
auto_pruners=[power_pruner]
)

Pruning Logic Adjustment

AIBooster integration pruners skip monitoring during the warmup_duration (default 60 seconds) from the start of execution, then perform checks every check_interval (default 10 seconds). At each checkpoint, they calculate statistical values (mean, min, or max) from metrics collected during the last check_interval period and compare against the threshold condition to determine whether to prune.

For example, with the default GPU utilization pruner settings, after 60 seconds from execution start, it checks the "average GPU utilization over the last 10 seconds" every 10 seconds and prunes if it falls below the threshold. However, in deep learning workloads, CPU processing commonly continues for more than 10 seconds during data loading, evaluation phases, checkpoint saving, etc.

Please adjust these parameters appropriately based on your workload characteristics. If data preparation takes time, increase warmup_duration; if evaluation is frequent, increase check_interval to prevent unintended pruning.

Usage Examples

Please refer to the following example files:

  • examples/pruners/timeout_pruner_example.py
  • examples/pruners/gpu_memory_pruner_example.py