メインコンテンツまでスキップ
バージョン: v2509

auto_pruners.aib_metrics

AIBoosterDCGMMetricsPruner Objects

class AIBoosterDCGMMetricsPruner(AutoPrunerBase)

Prune based on AIBooster DCGM metrics conditions.

__init__

def __init__(aibooster_server_address: str,
metric_name: str,
threshold: float,
prune_when: str = "below",
reduction: str = "mean",
agent_gpu_filter: dict[str, list[int]] | None = None,
check_interval: float = 10.0,
warmup_duration: float = 60.0)

Initialize AIBooster DCGM metrics pruner.

Pruning logic:

  1. Waits for warmup_duration seconds after trial start (skips monitoring)
  2. After warmup, checks metrics every check_interval seconds
  3. For each check, calculates the statistical value (mean/min/max/median) from metrics collected in the last check_interval period
  4. Prunes if the statistical value meets the threshold condition

Example: With default settings (warmup=60s, interval=10s, reduction="mean"), starts monitoring after 60 seconds, then every 10 seconds calculates the mean value of metrics from the last 10 seconds and compares with threshold.

Arguments:

  • aibooster_server_address - AIBooster server address
  • metric_name - DCGM metric to monitor
  • threshold - Metric threshold for pruning
  • prune_when - When to prune - "below" or "above" the threshold
  • reduction - Statistical reduction method ("mean", "max", "min", "median")
  • agent_gpu_filter - Dict of agent_name -> [gpu_indices] to filter specific GPUs (None = all)
  • check_interval - Interval between metric checks in seconds
  • warmup_duration - Warmup period before starting checks in seconds (60+ seconds recommended)

on_start

def on_start() -> None

Called when command execution starts.

on_end

def on_end() -> None

Called when command execution ends.

should_prune

def should_prune() -> bool

Check if command should be terminated based on metrics.

AIBoosterGPUUtilizationPruner Objects

class AIBoosterGPUUtilizationPruner(AIBoosterDCGMMetricsPruner)

Prune based on GPU utilization below threshold.

__init__

def __init__(aibooster_server_address: str,
threshold: float,
agent_gpu_filter: dict[str, list[int]] | None = None,
check_interval: float = 10.0,
warmup_duration: float = 60.0)

Initialize GPU utilization pruner.

Arguments:

  • aibooster_server_address - AIBooster server address (e.g., "http://localhost:16697")
  • threshold - GPU utilization threshold below which to prune (e.g., 5.0 for 5%)
  • agent_gpu_filter - Dict of agent_name -> [gpu_indices] to filter specific GPUs (None = all)
  • check_interval - Interval between metric checks in seconds
  • warmup_duration - Warmup period before starting checks in seconds (60+ seconds recommended)

AIBoosterGPUMemoryUsedPruner Objects

class AIBoosterGPUMemoryUsedPruner(AIBoosterDCGMMetricsPruner)

Prune based on GPU memory usage (MB) above threshold.

__init__

def __init__(aibooster_server_address: str,
threshold: float,
agent_gpu_filter: dict[str, list[int]] | None = None,
check_interval: float = 10.0,
warmup_duration: float = 60.0)

Initialize GPU memory utilization pruner.

Arguments:

  • aibooster_server_address - AIBooster server address (e.g., "http://localhost:16697")
  • threshold - GPU memory usage threshold in MB above which to prune
  • agent_gpu_filter - Dict of agent_name -> [gpu_indices] to filter specific GPUs (None = all)
  • check_interval - Interval between metric checks in seconds
  • warmup_duration - Warmup period before starting checks in seconds (60+ seconds recommended)

AIBoosterTemperaturePruner Objects

class AIBoosterTemperaturePruner(AIBoosterDCGMMetricsPruner)

Prune based on GPU temperature above threshold.

__init__

def __init__(aibooster_server_address: str,
threshold: float,
agent_gpu_filter: dict[str, list[int]] | None = None,
check_interval: float = 10.0,
warmup_duration: float = 60.0)

Initialize GPU temperature pruner.

Arguments:

  • aibooster_server_address - AIBooster server address (e.g., "http://localhost:16697")
  • threshold - GPU temperature threshold above which to prune
  • agent_gpu_filter - Dict of agent_name -> [gpu_indices] to filter specific GPUs (None = all)
  • check_interval - Interval between metric checks in seconds
  • warmup_duration - Warmup period before starting checks in seconds (60+ seconds recommended)