Version: v2603

Megatron Preset

A preset for automatically tuning distributed training parameters for multi-GPU setups with Megatron-LM / ms-swift. Searches parallelization configurations, micro-batch size, and activation recomputation settings to maximize training throughput (TFLOP/s/GPU).

For basic CLI operations (optimize / apply / analyze / timeout, etc.), see No-Code Tuning Usage.

Quick Start

ZenithTune wraps the framework's training command directly. Specify the actual command that accepts CLI options, not a shell script like bash train.sh.

Megatron-LM example:

zenithtune optimize --preset megatron \
    --args n_gpus=8,gbs=64 \
    --n-trials 50 --timeout-dynamic \
    -- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3 ...

ms-swift example:

zenithtune optimize --preset megatron \
    --args n_gpus=8,gbs=64 \
    --n-trials 50 --timeout-dynamic \
    -- swift sft --model Qwen/Qwen3-30B-A3B --max_steps 3 ...

warning

ZenithTune automatically injects and overwrites options like --tensor-model-parallel-size, so no changes or deletions of existing options in the training command are needed.

However, reducing iterations/epochs/dataset size in the training command to shorten tuning time is effective. Note that tuning with intentionally reduced epochs may yield results that differ from production training behavior.

Training Command Requirements

Commands passed to ZenithTune must meet the following requirements:

Accept Megatron-style CLI options — ZenithTune injects options such as --tensor-model-parallel-size into the command. Existing options are overwritten, and missing ones are appended
Output throughput to stdout — In the format throughput per GPU (TFLOP/s/GPU): <value>. If multiple lines match, the last value is used. Megatron outputs in this format by default, so no additional work is needed

warning

If you specify a wrapper script like bash train.sh, the CLI options injected by ZenithTune become arguments to the shell script and are not passed to the framework itself. Specify a command that directly accepts CLI options, such as python pretrain_gpt.py or swift sft.

args

Pass arguments as comma-separated key=value pairs via --args.

Argument	Description	Default
`n_gpus`	Total number of GPUs (required). Must be a power of 2 when using MegatronHeuristicStrategy	-
`gbs`	Global batch size (required)	-
`model`	HuggingFace model ID. Auto-detects layer count, KV head count, and VLM flag for pruning	None
`num_layers`	Number of layers. Overrides `model` auto-detection	None
`num_query_groups`	Number of KV heads. Overrides `model` auto-detection	None
`mbs_max`	Upper bound for micro-batch size search	8
`stage1_trials`	Number of Stage 1 trials for `megatron-staged-blackbox` strategy	30

tip

model, num_layers, and num_query_groups are all optional. Specifying these enables pruning based on GQA constraints, layer count constraints, and VLM model constraints, improving search efficiency. model auto-detects layer count, KV head count, and VLM flag from HuggingFace. Specifying num_layers and num_query_groups instead of model enables GQA and layer count constraints. However, VLM model constraints are only available when model is specified.

Search Parameters

Parallelization Parameters

Parameter	CLI Option	Choices	Description
TP	`--tensor-model-parallel-size`	1, 2, 4, ... n_gpus	Tensor parallelism
PP	`--pipeline-model-parallel-size`	Same as above	Pipeline parallelism
CP	`--context-parallel-size`	Same as above	Context parallelism
EP	`--expert-model-parallel-size`	Same as above	Expert parallelism (MoE)
ETP	`--expert-tensor-parallel-size`	Same as above	Expert tensor parallelism (MoE, must be ≤ TP)

Computational Efficiency Parameters

Parameter	CLI Option	Choices	Description
MBS	`--micro-batch-size`	1, 2, 4, ... mbs_max (divisors of GBS only)	Micro-batch size
GBS	`--global-batch-size`	Fixed value (specified by `gbs`)	Global batch size
Activation recomputation granularity	`--recompute-granularity`	`full`, `selective`	Activation recomputation method
Activation recomputation method	`--recompute-method`	`uniform`, `block`	Only effective with full granularity
Activation recomputation layers	`--recompute-num-layers`	1–8	Only effective with full granularity
ViT gradient checkpointing	`--vit-gradient-checkpointing`	`true`, `false`	For VLM models

info

With MegatronHeuristicStrategy (default), PP=1, CP=1, ETP=1, recompute_method=uniform, recompute_num_layers=1, and vit_gradient_checkpointing=true are fixed. These parameters are searched when using --strategy megatron-staged-blackbox.

Search Strategies

Strategy	CLI Option	Description
MegatronHeuristicStrategy	Default	Deterministic search based on domain knowledge
MegatronStagedBlackboxStrategy	`--strategy megatron-staged-blackbox`	Staged Bayesian optimization search

# staged-blackbox strategy example
zenithtune optimize --preset megatron \
    --args n_gpus=8,gbs=64,stage1_trials=20 \
    --strategy megatron-staged-blackbox --n-trials 50 --timeout-dynamic \
    -- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3

Evaluator

The default evaluator is MegatronThroughputEvaluator, which extracts throughput per GPU (TFLOP/s/GPU): <value> from stdout and maximizes it. If the pattern appears multiple times, the last value is used.

For how to change the evaluator, see What is an Evaluator?.

Examples

# Test run (dummy script)
zenithtune optimize --preset megatron \
    --args n_gpus=8,gbs=64 \
    --n-trials 50 \
    -- python <aibooster-examples>/intelligence/zenith_tune/nocode/fake_megatron_train.py

# Megatron-LM (with HuggingFace model config, GQA and VLM constraints auto-enabled)
zenithtune optimize --preset megatron \
    --args model=Qwen/Qwen3-Omni-30B-A3B-Instruct,n_gpus=8,gbs=64 \
    --n-trials 50 --timeout-dynamic \
    -- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3

# ms-swift
zenithtune optimize --preset megatron \
    --args n_gpus=8,gbs=64 \
    --n-trials 50 --timeout-dynamic \
    -- swift sft --model Qwen/Qwen3-30B-A3B --max_steps 3

# Analyze results
zenithtune analyze outputs/study_YYYYMMDD_HHMMSS/study.db

# Apply best parameters (--args also required)
zenithtune apply \
    --db-path outputs/study_YYYYMMDD_HHMMSS/study.db \
    --preset megatron --args n_gpus=8,gbs=64 \
    -- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 100

Training Settings for Tuning

Minimizing per-trial execution time is important during tuning. Adjust your training settings as follows:

Reduce the number of iterations — Set to the minimum value needed for throughput measurement. However, if the iteration count is too small, warmup effects may impact results, so choose an appropriate value
Disable validation — Validation is unnecessary for throughput measurement and significantly increases trial time
Disable checkpoint saving — Not needed during tuning

These settings should be configured via training command options or configuration files, not in ZenithTune.

Troubleshooting

Problem	Solution
All trials fail	Run the command directly and verify `throughput per GPU (TFLOP/s/GPU): <value>` appears in stdout
"n_gpus must be a power of 2"	Use `--strategy megatron-staged-blackbox`
Search is truncated prematurely	Increase `--n-trials`
Trials are progressing slowly	Check if validation is enabled or if the number of iterations is too large

Quick Start​

Training Command Requirements​

args​

Search Parameters​

Parallelization Parameters​

Computational Efficiency Parameters​

Search Strategies​

Evaluator​

Examples​

Training Settings for Tuning​

Troubleshooting​