Megatron Preset
A preset for automatically tuning distributed training parameters for multi-GPU setups with Megatron-LM / ms-swift. Searches parallelization configurations, micro-batch size, and activation recomputation settings to maximize training throughput (TFLOP/s/GPU).
For basic CLI operations (optimize / apply / analyze / timeout, etc.), see No-Code Tuning Usage.
Quick Start
ZenithTune wraps the framework's training command directly. Specify the actual command that accepts CLI options, not a shell script like bash train.sh.
Megatron-LM example:
zenithtune optimize --preset megatron \
--args n_gpus=8,gbs=64 \
--n-trials 50 --timeout-dynamic \
-- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3 ...
ms-swift example:
zenithtune optimize --preset megatron \
--args n_gpus=8,gbs=64 \
--n-trials 50 --timeout-dynamic \
-- swift sft --model Qwen/Qwen3-30B-A3B --max_steps 3 ...
ZenithTune automatically injects and overwrites options like --tensor-model-parallel-size, so no changes or deletions of existing options in the training command are needed.
However, reducing iterations/epochs/dataset size in the training command to shorten tuning time is effective. Note that tuning with intentionally reduced epochs may yield results that differ from production training behavior.
Training Command Requirements
Commands passed to ZenithTune must meet the following requirements:
- Accept Megatron-style CLI options — ZenithTune injects options such as
--tensor-model-parallel-sizeinto the command. Existing options are overwritten, and missing ones are appended - Output throughput to stdout — In the format
throughput per GPU (TFLOP/s/GPU): <value>. If multiple lines match, the last value is used. Megatron outputs in this format by default, so no additional work is needed
If you specify a wrapper script like bash train.sh, the CLI options injected by ZenithTune become arguments to the shell script and are not passed to the framework itself. Specify a command that directly accepts CLI options, such as python pretrain_gpt.py or swift sft.
args
Pass arguments as comma-separated key=value pairs via --args.
| Argument | Description | Default |
|---|---|---|
n_gpus | Total number of GPUs (required). Must be a power of 2 when using MegatronHeuristicStrategy | - |
gbs | Global batch size (required) | - |
model | HuggingFace model ID. Auto-detects layer count, KV head count, and VLM flag for pruning | None |
num_layers | Number of layers. Overrides model auto-detection | None |
num_query_groups | Number of KV heads. Overrides model auto-detection | None |
mbs_max | Upper bound for micro-batch size search | 8 |
stage1_trials | Number of Stage 1 trials for megatron-staged-blackbox strategy | 30 |
model, num_layers, and num_query_groups are all optional.
Specifying these enables pruning based on GQA constraints, layer count constraints, and VLM model constraints, improving search efficiency.
model auto-detects layer count, KV head count, and VLM flag from HuggingFace.
Specifying num_layers and num_query_groups instead of model enables GQA and layer count constraints. However, VLM model constraints are only available when model is specified.
Search Parameters
Parallelization Parameters
| Parameter | CLI Option | Choices | Description |
|---|---|---|---|
| TP | --tensor-model-parallel-size | 1, 2, 4, ... n_gpus | Tensor parallelism |
| PP | --pipeline-model-parallel-size | Same as above | Pipeline parallelism |
| CP | --context-parallel-size | Same as above | Context parallelism |
| EP | --expert-model-parallel-size | Same as above | Expert parallelism (MoE) |
| ETP | --expert-tensor-parallel-size | Same as above | Expert tensor parallelism (MoE, must be ≤ TP) |
Computational Efficiency Parameters
| Parameter | CLI Option | Choices | Description |
|---|---|---|---|
| MBS | --micro-batch-size | 1, 2, 4, ... mbs_max (divisors of GBS only) | Micro-batch size |
| GBS | --global-batch-size | Fixed value (specified by gbs) | Global batch size |
| Activation recomputation granularity | --recompute-granularity | full, selective | Activation recomputation method |
| Activation recomputation method | --recompute-method | uniform, block | Only effective with full granularity |
| Activation recomputation layers | --recompute-num-layers | 1–8 | Only effective with full granularity |
| ViT gradient checkpointing | --vit-gradient-checkpointing | true, false | For VLM models |
With MegatronHeuristicStrategy (default), PP=1, CP=1, ETP=1, recompute_method=uniform, recompute_num_layers=1, and vit_gradient_checkpointing=true are fixed. These parameters are searched when using --strategy megatron-staged-blackbox.
Search Strategies
| Strategy | CLI Option | Description |
|---|---|---|
| MegatronHeuristicStrategy | Default | Deterministic search based on domain knowledge |
| MegatronStagedBlackboxStrategy | --strategy megatron-staged-blackbox | Staged Bayesian optimization search |
# staged-blackbox strategy example
zenithtune optimize --preset megatron \
--args n_gpus=8,gbs=64,stage1_trials=20 \
--strategy megatron-staged-blackbox --n-trials 50 --timeout-dynamic \
-- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3
Evaluator
The default evaluator is MegatronThroughputEvaluator, which extracts throughput per GPU (TFLOP/s/GPU): <value> from stdout and maximizes it. If the pattern appears multiple times, the last value is used.
For how to change the evaluator, see What is an Evaluator?.
Examples
# Test run (dummy script)
zenithtune optimize --preset megatron \
--args n_gpus=8,gbs=64 \
--n-trials 50 \
-- python <aibooster-examples>/intelligence/zenith_tune/nocode/fake_megatron_train.py
# Megatron-LM (with HuggingFace model config, GQA and VLM constraints auto-enabled)
zenithtune optimize --preset megatron \
--args model=Qwen/Qwen3-Omni-30B-A3B-Instruct,n_gpus=8,gbs=64 \
--n-trials 50 --timeout-dynamic \
-- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 3
# ms-swift
zenithtune optimize --preset megatron \
--args n_gpus=8,gbs=64 \
--n-trials 50 --timeout-dynamic \
-- swift sft --model Qwen/Qwen3-30B-A3B --max_steps 3
# Analyze results
zenithtune analyze outputs/study_YYYYMMDD_HHMMSS/study.db
# Apply best parameters (--args also required)
zenithtune apply \
--db-path outputs/study_YYYYMMDD_HHMMSS/study.db \
--preset megatron --args n_gpus=8,gbs=64 \
-- python pretrain_gpt.py --num-layers 30 --hidden-size 4096 --train-iters 100
Training Settings for Tuning
Minimizing per-trial execution time is important during tuning. Adjust your training settings as follows:
- Reduce the number of iterations — Set to the minimum value needed for throughput measurement. However, if the iteration count is too small, warmup effects may impact results, so choose an appropriate value
- Disable validation — Validation is unnecessary for throughput measurement and significantly increases trial time
- Disable checkpoint saving — Not needed during tuning
These settings should be configured via training command options or configuration files, not in ZenithTune.
Troubleshooting
| Problem | Solution |
|---|---|
| All trials fail | Run the command directly and verify throughput per GPU (TFLOP/s/GPU): <value> appears in stdout |
| "n_gpus must be a power of 2" | Use --strategy megatron-staged-blackbox |
| Search is truncated prematurely | Increase --n-trials |
| Trials are progressing slowly | Check if validation is enabled or if the number of iterations is too large |