Skip to main content
Version: v2603

intelligence.zenith_tune.presets.megatron

Preset for fine-tuning with Megatron + ms-swift.

MegatronPreset Objects

@PresetRegistry.register("megatron")
class MegatronPreset(TuningPreset)

Preset for fine-tuning with Megatron + ms-swift on multiple GPUs.

Parallelization parameters and their roles:

Input (fixed): N_GPU Specified via --args n_gpus=8. Required. GBS Specified via --args gbs=64. Required.

Output (search space): TP, EP, PP, CP, ETP, MBS, recompute_granularity, recompute_method, recompute_num_layers, vit_gradient_checkpointing All included in the search space. Which parameters are actually varied depends on the strategy (see each strategy's docstring).

Internal: DP Used only in prune constraints. Not passed to Megatron.

Not managed by ZenithTune: grad_accum_steps Computed internally by Megatron.

Relations: DP = N_GPU / (TP * PP * CP * EP * ETP) grad_accum_steps = GBS / (DP * MBS)

Prune constraints:

  1. N_GPU % (TP * PP * CP * EP * ETP) == 0 (DP is a positive integer)
  2. GBS % (DP * MBS) == 0 (grad_accum_steps is a positive integer)
  3. ETP does not exceed TP
  4. num_query_groups % TP == 0 (GQA constraint, when available)
  5. num_layers % PP == 0 (when num_layers is available)
  6. recompute_num_layers divides num_layers/PP (when num_layers is available)
  7. selective granularity and block method disabled (VLM models only)

Parameters are injected into the training command as CLI options:

--tensor-model-parallel-size, --pipeline-model-parallel-size, --context-parallel-size, --expert-model-parallel-size, --expert-tensor-parallel-size, --micro-batch-size, --global-batch-size, --recompute-granularity, --recompute-method, --recompute-num-layers, --vit-gradient-checkpointing

When model is specified, num_hidden_layers and num_key_value_heads are automatically extracted from the HuggingFace model config. These are used for GQA constraint pruning (TP must divide num_query_groups) and recompute_num_layers validation.

Examples:

Smoke test with the bundled fake training script:

zenithtune optimize --preset megatron --args n_gpus=8,gbs=16
-- python examples/intelligence/zenith_tune/nocode/fake_megatron_train.py

Real training (with HF model config for MoE constraints):

zenithtune optimize --preset megatron
--args model=Qwen/Qwen3-Omni-30B-A3B-Instruct,n_gpus=8,gbs=16
--n-trials 30 --timeout-dynamic
-- bash train.sh

__init__

def __init__(*,
n_gpus: int,
mbs_max: int = 8,
gbs: int,
model: str | None = None,
num_layers: int | None = None,
num_query_groups: int | None = None) -> None

Initialize the preset.

Arguments:

  • n_gpus - Total number of GPUs (>= 1, required). Used to generate tp, pp, cp, ep, etp choices (all powers of 2 up to n_gpus).
  • mbs_max - Upper bound for micro-batch size (>= 1). mbs choices are all powers of 2 from 1 up to mbs_max.
  • gbs - Global batch size (required). Passed as a fixed CLI option. MBS choices are filtered to divisors of GBS, and combinations where GBS is not divisible by DP*MBS are pruned.
  • model - HuggingFace model ID. If specified, num_layers and num_query_groups are auto-detected from config.json.
  • num_layers - Number of hidden layers. Overrides model auto-detection.
  • num_query_groups - Number of KV heads for GQA constraint. Overrides model auto-detection.

get_search_space

def get_search_space() -> dict[str, Parameter]

Return the search space for Megatron parallelization parameters.

tp, pp, cp, ep, etp choices are all powers of 2 up to n_gpus. mbs choices are all powers of 2 up to mbs_max. recompute_granularity choices are "full" and "selective". recompute_method choices are "uniform" and "block". recompute_num_layers choices are 1 through _RECOMPUTE_NUM_LAYERS_MAX. vit_gradient_checkpointing choices are "true" and "false".

apply_parameters

def apply_parameters(command: str, params: dict[str, Any]) -> str

Apply Megatron parallelization parameters as CLI flags.

Arguments:

  • command - Base command string.
  • params - Parameter dict with keys from get_search_space().

Returns:

Command string with Megatron CLI flags injected/updated.

prune

def prune(params: dict[str, Any]) -> bool

Prune invalid parallelization combinations.

Base check: n_gpus must be divisible by tp * pp * cp * ep * etp.

GBS check: GBS must be divisible by DP * MBS, where DP = n_gpus / (tp * pp * cp * ep * etp). This ensures gradient_accumulation_steps is a positive integer.

General constraint:

  • ETP must not exceed TP

When num_query_groups is available (via model or explicit num_query_groups):

  • TP must divide num_query_groups (GQA constraint)

When num_layers is available (via model or explicit num_layers):

  • recompute_num_layers must evenly divide layers per PP stage

When model is specified and the model is detected as a VLM (config has vision keys such as thinker_config, vision_config):

  • selective granularity is not supported
  • block recompute method is not supported
def get_recommended_strategy() -> TuningStrategy

Return MegatronHeuristicStrategy as the default strategy.