Version: v2603

intelligence.zenith_tune.presets.megatron

Preset for fine-tuning with Megatron + ms-swift.

MegatronPreset Objects

@PresetRegistry.register("megatron")
class MegatronPreset(TuningPreset)

Preset for fine-tuning with Megatron + ms-swift on multiple GPUs.

Parallelization parameters and their roles:

Input (fixed): N_GPU Specified via --args n_gpus=8. Required. GBS Specified via --args gbs=64. Required.

Output (search space): TP, EP, PP, CP, ETP, MBS, recompute_granularity, recompute_method, recompute_num_layers, vit_gradient_checkpointing All included in the search space. Which parameters are actually varied depends on the strategy (see each strategy's docstring).

Internal: DP Used only in prune constraints. Not passed to Megatron.

Not managed by ZenithTune: grad_accum_steps Computed internally by Megatron.

Relations: DP = N_GPU / (TP * PP * CP * EP * ETP) grad_accum_steps = GBS / (DP * MBS)

Prune constraints:

N_GPU % (TP * PP * CP * EP * ETP) == 0 (DP is a positive integer)
GBS % (DP * MBS) == 0 (grad_accum_steps is a positive integer)
ETP does not exceed TP
num_query_groups % TP == 0 (GQA constraint, when available)
num_layers % PP == 0 (when num_layers is available)
recompute_num_layers divides num_layers/PP (when num_layers is available)
selective granularity and block method disabled (VLM models only)

Parameters are injected into the training command as CLI options:

--tensor-model-parallel-size, --pipeline-model-parallel-size, --context-parallel-size, --expert-model-parallel-size, --expert-tensor-parallel-size, --micro-batch-size, --global-batch-size, --recompute-granularity, --recompute-method, --recompute-num-layers, --vit-gradient-checkpointing

When model is specified, num_hidden_layers and num_key_value_heads are automatically extracted from the HuggingFace model config. These are used for GQA constraint pruning (TP must divide num_query_groups) and recompute_num_layers validation.

Examples:

Smoke test with the bundled fake training script:

zenithtune optimize --preset megatron --args n_gpus=8,gbs=16
-- python examples/intelligence/zenith_tune/nocode/fake_megatron_train.py

Real training (with HF model config for MoE constraints):

zenithtune optimize --preset megatron
--args model=Qwen/Qwen3-Omni-30B-A3B-Instruct,n_gpus=8,gbs=16
--n-trials 30 --timeout-dynamic
-- bash train.sh

init

def __init__(*,
             n_gpus: int,
             mbs_max: int = 8,
             gbs: int,
             model: str | None = None,
             num_layers: int | None = None,
             num_query_groups: int | None = None) -> None

Initialize the preset.

Arguments:

n_gpus - Total number of GPUs (>= 1, required). Used to generate tp, pp, cp, ep, etp choices (all powers of 2 up to n_gpus).
mbs_max - Upper bound for micro-batch size (>= 1). mbs choices are all powers of 2 from 1 up to mbs_max.
gbs - Global batch size (required). Passed as a fixed CLI option. MBS choices are filtered to divisors of GBS, and combinations where GBS is not divisible by DP*MBS are pruned.
model - HuggingFace model ID. If specified, num_layers and num_query_groups are auto-detected from config.json.
num_layers - Number of hidden layers. Overrides model auto-detection.
num_query_groups - Number of KV heads for GQA constraint. Overrides model auto-detection.

get_search_space

def get_search_space() -> dict[str, Parameter]

Return the search space for Megatron parallelization parameters.

tp, pp, cp, ep, etp choices are all powers of 2 up to n_gpus. mbs choices are all powers of 2 up to mbs_max. recompute_granularity choices are "full" and "selective". recompute_method choices are "uniform" and "block". recompute_num_layers choices are 1 through _RECOMPUTE_NUM_LAYERS_MAX. vit_gradient_checkpointing choices are "true" and "false".

apply_parameters

def apply_parameters(command: str, params: dict[str, Any]) -> str

Apply Megatron parallelization parameters as CLI flags.

Arguments:

command - Base command string.
params - Parameter dict with keys from get_search_space().

Returns:

Command string with Megatron CLI flags injected/updated.

prune

def prune(params: dict[str, Any]) -> bool

Prune invalid parallelization combinations.

Base check: n_gpus must be divisible by tp * pp * cp * ep * etp.

GBS check: GBS must be divisible by DP * MBS, where DP = n_gpus / (tp * pp * cp * ep * etp). This ensures gradient_accumulation_steps is a positive integer.

General constraint:

ETP must not exceed TP

When num_query_groups is available (via model or explicit num_query_groups):

TP must divide num_query_groups (GQA constraint)

When num_layers is available (via model or explicit num_layers):

recompute_num_layers must evenly divide layers per PP stage

When model is specified and the model is detected as a VLM (config has vision keys such as thinker_config, vision_config):

selective granularity is not supported
block recompute method is not supported

get_recommended_strategy

def get_recommended_strategy() -> TuningStrategy

Return MegatronHeuristicStrategy as the default strategy.

MegatronPreset Objects​

Smoke test with the bundled fake training script:

Real training (with HF model config for MoE constraints):

__init__​

get_search_space​

apply_parameters​

prune​

get_recommended_strategy​