intelligence.zenith_tune.presets.megatron
Preset for fine-tuning with Megatron + ms-swift.
MegatronPreset Objects
@PresetRegistry.register("megatron")
class MegatronPreset(TuningPreset)
Preset for fine-tuning with Megatron + ms-swift on multiple GPUs.
Parallelization parameters and their roles:
Input (fixed): N_GPU Specified via --args n_gpus=8. Required. GBS Specified via --args gbs=64. Required.
Output (search space): TP, EP, PP, CP, ETP, MBS, recompute_granularity, recompute_method, recompute_num_layers, vit_gradient_checkpointing All included in the search space. Which parameters are actually varied depends on the strategy (see each strategy's docstring).
Internal: DP Used only in prune constraints. Not passed to Megatron.
Not managed by ZenithTune: grad_accum_steps Computed internally by Megatron.
Relations: DP = N_GPU / (TP * PP * CP * EP * ETP) grad_accum_steps = GBS / (DP * MBS)
Prune constraints:
- N_GPU % (TP * PP * CP * EP * ETP) == 0 (DP is a positive integer)
- GBS % (DP * MBS) == 0 (grad_accum_steps is a positive integer)
- ETP does not exceed TP
- num_query_groups % TP == 0 (GQA constraint, when available)
- num_layers % PP == 0 (when num_layers is available)
- recompute_num_layers divides num_layers/PP (when num_layers is available)
- selective granularity and block method disabled (VLM models only)
Parameters are injected into the training command as CLI options:
--tensor-model-parallel-size, --pipeline-model-parallel-size, --context-parallel-size, --expert-model-parallel-size, --expert-tensor-parallel-size, --micro-batch-size, --global-batch-size, --recompute-granularity, --recompute-method, --recompute-num-layers, --vit-gradient-checkpointing
When model is specified, num_hidden_layers and num_key_value_heads are
automatically extracted from the HuggingFace model config. These are used
for GQA constraint pruning (TP must divide num_query_groups) and
recompute_num_layers validation.
Examples:
Smoke test with the bundled fake training script:
zenithtune optimize --preset megatron --args n_gpus=8,gbs=16
-- python examples/intelligence/zenith_tune/nocode/fake_megatron_train.py
Real training (with HF model config for MoE constraints):
zenithtune optimize --preset megatron
--args model=Qwen/Qwen3-Omni-30B-A3B-Instruct,n_gpus=8,gbs=16
--n-trials 30 --timeout-dynamic
-- bash train.sh
__init__
def __init__(*,
n_gpus: int,
mbs_max: int = 8,
gbs: int,
model: str | None = None,
num_layers: int | None = None,
num_query_groups: int | None = None) -> None
Initialize the preset.
Arguments:
n_gpus- Total number of GPUs (>= 1, required). Used to generate tp, pp, cp, ep, etp choices (all powers of 2 up to n_gpus).mbs_max- Upper bound for micro-batch size (>= 1). mbs choices are all powers of 2 from 1 up to mbs_max.gbs- Global batch size (required). Passed as a fixed CLI option. MBS choices are filtered to divisors of GBS, and combinations where GBS is not divisible by DP*MBS are pruned.model- HuggingFace model ID. If specified, num_layers and num_query_groups are auto-detected from config.json.num_layers- Number of hidden layers. Overrides model auto-detection.num_query_groups- Number of KV heads for GQA constraint. Overrides model auto-detection.
get_search_space
def get_search_space() -> dict[str, Parameter]
Return the search space for Megatron parallelization parameters.
tp, pp, cp, ep, etp choices are all powers of 2 up to n_gpus. mbs choices are all powers of 2 up to mbs_max. recompute_granularity choices are "full" and "selective". recompute_method choices are "uniform" and "block". recompute_num_layers choices are 1 through _RECOMPUTE_NUM_LAYERS_MAX. vit_gradient_checkpointing choices are "true" and "false".
apply_parameters
def apply_parameters(command: str, params: dict[str, Any]) -> str
Apply Megatron parallelization parameters as CLI flags.
Arguments:
command- Base command string.params- Parameter dict with keys from get_search_space().
Returns:
Command string with Megatron CLI flags injected/updated.
prune
def prune(params: dict[str, Any]) -> bool
Prune invalid parallelization combinations.
Base check: n_gpus must be divisible by tp * pp * cp * ep * etp.
GBS check: GBS must be divisible by DP * MBS, where DP = n_gpus / (tp * pp * cp * ep * etp). This ensures gradient_accumulation_steps is a positive integer.
General constraint:
- ETP must not exceed TP
When num_query_groups is available (via model or explicit
num_query_groups):
- TP must divide num_query_groups (GQA constraint)
When num_layers is available (via model or explicit
num_layers):
- recompute_num_layers must evenly divide layers per PP stage
When model is specified and the model is detected as a VLM
(config has vision keys such as thinker_config, vision_config):
- selective granularity is not supported
- block recompute method is not supported
get_recommended_strategy
def get_recommended_strategy() -> TuningStrategy
Return MegatronHeuristicStrategy as the default strategy.