intelligence.zenith_tune.evaluators.megatron
Evaluator for Megatron training throughput.
MegatronThroughputEvaluator Objects
@EvaluatorRegistry.register("megatron")
class MegatronThroughputEvaluator(TuningEvaluator)
Extract throughput (TFLOP/s/GPU) from Megatron training output.
Searches stdout for the last occurrence of
throughput per GPU (TFLOP/s/GPU): <value>.
Raises ValueError if the pattern is not found in stdout.
Example:
evaluator = MegatronThroughputEvaluator() value = evaluator.evaluate(stdout, metadata)
evaluate
def evaluate(stdout: str, metadata: dict[str, Any]) -> float
Extract the last reported throughput value from stdout.
Arguments:
stdout- The stdout output from the training command.metadata- Trial metadata (unused).
Returns:
Throughput in TFLOP/s/GPU.
Raises:
ValueError- If the throughput line is not found in stdout.