Skip to main content
Version: v2506

List of Megatron-LM Acceleration Options

network_size_args

Option NameDescriptionDefault ValueConstraints
num-layersNumber of transformer layersNoneModel dependent, int
encoder-num-layersNumber of transformer encoder layersNoneModel dependent, int
decoder-num-layersNumber of transformer decoder layersNoneModel dependent, int
hidden-sizeHidden size of transformerNoneModel dependent, int
ffn-hidden-sizeHidden size of transformer FFN4 * hidden-sizeModel dependent, int
num-attention-headsNumber of transformer attention headsNoneModel dependent, int
kv-channelsProjection weights dimension for multi-head attentionhidden-size // num-attention-headsModel dependent, int
group-query-attentionUse group-query-attentionDisabledBoolean
num-query-groupsNumber of query groups for group-query-attention1int
max-position-embeddingsPosition embedding sizeNoneModel dependent, int
position-embedding-typeDeprecated. Position embedding type"learned_absolute"learned_absolute or "rope" or "none"
use-rotary-position-embeddingsUse rotary position embeddingsDisabledBoolean
rotary-baseTheta value for rotary position embeddings10000int
rotary-percentUsage percentage of rotary dimension1.0 (100%)float
rotary-interleavedUse interleaved rotary embeddingDisabledBoolean
rotary-seq-len-interpolation-factorSequence length interpolation factor for rotary embeddingsNoneint
no-position-embeddingDo not use position embeddingDisabledBoolean
make-vocab-size-divisible-byA single size by which the vocab size is divisible for computational efficiency. The vocab size will be padded to this value.128int
normalizationLayer normalization type"layernorm""layernorm" or "rmsnorm"
norm-epsilonLayer norm epsilon1e-5float
apply-layernorm-1pAdjust layer norm to be centered around 0 to improve numerical stabilityDisabledBoolean
apply-residual-connection-post-layernormUse the original BERT residual connection orderDisabledBoolean
openai-geluUse OpenAI GeLU (deprecated except for ensuring backward compatibility)DisabledBoolean
squared-reluUse squared relu activation instead of the default geluDisabledBoolean
swigluUse gated linear units and SiLU activation instead of the default geluDisabledBoolean
onnx-safeOption to avoid known issues with the ONNX exporterfalsebool
bert-no-binary-headDisable BERT binary headDisabledBoolean
untie-embeddings-and-output-weightsUntie embeddings and output weightsDisabledBoolean

regularization_args

Option NameDescriptionDefault ValueConstraints
attention-dropoutDropout rate after attention0.1float
hidden-dropoutDropout rate for hidden state transformer0.1float
weight-decayWeight decay coefficient for L2 regularization0.01float
start-weight-decayInitial weight decay coefficient for L2 regularizationNonefloat
end-weight-decayFinal weight decay coefficient for L2 regularizationNonefloat
weight-decay-incr-styleWeight decay increment function"constant""constant" or "linear" or "cosine"
clip-gradGradient clipping based on global L2 norm1.0float
adam-beta1Adam coefficient 10.9float
adam-beta2Adam coefficient 20.999float
adam-epsTerm added to the denominator to improve numerical stability1e-8float
sgd-momentumMomentum factor for SGD0.9float

training_args

Option NameDescriptionDefault ValueConstraints
micro-batch-sizeMicro batch sizeNoneint
global-batch-sizeGlobal batch sizeNone(micro-batch-size * data-parallel-size)int
rampup-batch-sizeLinearly increase batch size per iteration from start batch size up to global-batch-size. Format: --rampup-batch-size <start batch size> <batch size incerement> <ramp-up samples>None
decrease-batch-size-if-neededDecrease batch_size if it is not divisible by micro_batch_size × dp_sizedisabledBoolean
recompute-activationsEnable activation recomputation for training large models, sequences, and batch sizesdisabledBoolean
recompute-granularityGranularity of recompute-activations. "full": entire transformer layer, "selective": core attention part.None"full" or "selective"
no-check-for-nan-in-loss-and-gradDo not check for NaN in loss and gradientsdisabledBoolean
distribute-saved-activationsDistribute recomputed activations across model parallel groupsdisabledBoolean
recompute-methodRecomputation method. "uniform": Divide transformer layers uniformly and recompute in each chunk. "block": Recompute per pipeline stage.None"uniform" or "block"
recompute-num-layersNumber of layers to recompute using the method specified by recompute-method1int
no-clone-scatter-output-in-embeddingDo not clone the output of scatter in the embedding layer to the GC original tensordisabledBoolean
profileEnable nsys profilingdisabledBoolean
profile-step-startStarting global step to profile10int
profile-step-endEnding global step to profile12int
profile-ranksGlobal ranks to profile[0]array of int
tp-comm-overlapOverlap TP communication with GEMM kernelsdisabledBoolean
tp-comm-overlap-cfgConfig file for tp-comm-overlapNonestr
disable-tp-comm-overlap-agDisable overlap of GEMM and All-Gather in pipelinedisabledBoolean
disable-tp-comm-overlap-rsDisable overlap of GEMM and Reduce-Scatter in pipelinedisabledBoolean
tp-comm-overlap-rs-dgradOverlap Reduce-Scatter with dgrad GEMMdisabledBoolean
disable-tp-comm-bulk-dgradDisable overlap of All-Gather and bprop activation gradient GEMMdisabledBoolean
disable-tp-comm-bulk-wgradDisable overlap of All-Gather and bprop weight gradient GEMMdisabledBoolean
use-cpu-initializationInitialize weights on CPU to eliminate differences caused by TP initializationdisabledBoolean
empty-unused-memory-levelCall torch.cuda.empty\_cache()per iteration00 (off) or 1 (moderate) or 2 (aggressive)
deterministic-modeRun with deterministic behavior for debuggingdisabledBoolean
check-weight-hash-across-dp-replicas-intervalInterval to check weight hash across DP replicasNoneint
calculate-per-token-lossCalculate cross entropy loss for non-padded tokens in the global batchdisabledBoolean

deprecated

Option NameDescriptionDefault ValueConstraints
checkpoint-activationsEquivalent to recompute-activationsdisabledBoolean
train-itersNumber of training iterationsNoneint (mutually exclusive with train-samples)
train-samplesNumber of training samplesNoneint (mutually exclusive with train-iters)
log-intervalIteration interval for logging output100int
exit-intervalExit iterationNoneint
exit-duration-in-minsExit duration (minutes)Noneint
exit-signal-handlerSave checkpoint and exit after receiving SIGTERMdisabledBoolean
tensorboard-dirDirectory for tensorboard logsNonestr
no-masked-softmax-fusionDisable fusion of query_key_value scaling, masking, and softmaxdisabledBoolean
no-bias-gelu-fusionDisable fusion of bias and geludisabledBoolean
no-bias-swiglu-fusionDisable fusion of bias and swigludisabledBoolean
no-bias-dropout-fusionDisable fusion of bias and dropoutdisabledBoolean
no-bias-rope-fusionDisable rope fusion. Only supported by megatron-core.disabledBoolean
cross-entropy-loss-fusionFuse cross entropy loss calculationdisabledBoolean
use-flash-attnUse FlashAttentiondisabledBoolean
disable-bias-linearDisable bias in linear layersdisabledBoolean
add-qkv-biasEnable bias in QKV linear layerdisabledBoolean
optimizeroptimizer function"adam""adam" or "sgd"
dataloader-typesingle pass or multi passNone"single" or "cyclic" or "external"
no-async-tensor-model-parallel-allreduceThis option is ignoreddisabledBoolean
no-persist-layer-normDisable persistent fused layer norm kernel. Supported only for specific hidden sizes.disabledBoolean
sequence-parallelEnable sequence parallel optimization in Megatron-LMdisabledBoolean
no-gradient-accumulation-fusionDisable fusing gradient accumulation for weight gradient calculation in linear layersdisabledBoolean
use-mcore-modelsUse megatron-core. However, it is deprecated as mcore is used by default.disabledBoolean
use-legacy-modelsUse megatron-legacydisabledBoolean
manual-gcDo not use the threshold-based garbage collectordisabledBoolean
manual-gc-intervalTraining step interval to trigger garbage collection0int (0 does not trigger gc)
no-manual-gc-evalDo not perform manual gc during evaluationdisabledBoolean
disable-tp-comm-split-agDo not overlap All-Gather and fprop GEMMdisabledBoolean
disable-tp-comm-split-rsDo not overlap Reduce-Scatter and fprop GEMMdisabledBoolean

initialization_args

Option NameDescriptionDefault ValueConstraints
seedRandom seed used in python, numpy, pytorch, and cuda1234int
data-parallel-random-initRandomly initialize parameters along data parallel ranksdisabledBoolean
init-method-stdStandard deviation of zero-mean normal distribution used for weight initialization0.02float
init-method-xavier-uniformEnable Xavier uniform parameter initializationdisabledBoolean

learning_rate_args

Option NameDescriptionDefault ValueConstraints
lrInitial learning rate before applying warmup and decayNonefloat
lr-decay-styleLearning rate decay function"linear""constant" or "linear" or "cosine" or "inverse-square-root" or "WSD"
lr-wsd-decay-styleDecay style for WSD"exponential""exponential" or "linear" or "cosine"
lr-decay-itersNumber of iterations over which to decay learning rateNone (= train-iters)int
lr-decay-samplesNumber of samples over which to decay learning rateNone (= train-samples)int
lr-wsd-decay-samplesNumber of decay samples for WSDNoneint
lr-wsd-decay-itersNumber of decay iterations for WSDNoneint
lr-warmup-fractionFraction of lr-warmup-(iters/samples)Nonefloat
lr-warmup-itersNumber of iterations over which to linearly warmup learning rate0int
lr-warmup-samplesNumber of samples over which to linearly warmup learning rate0int
lr-warmup-initInitial value for lr warmup0.0float
warmupOld lr warmup parameter, use lr-warmup-*Noneint
min-lrMinimum learning rate0.0float
override-opt_param-schedulerReset all lr scheduler values, ignoring the checkpointdisabledBoolean
use-checkpoint-opt_param-schedulerUse lr scheduler values from the checkpointdisabledBoolean
decoupled-lrSeparate lr for input and output layersNonefloat
decoupled-min-lrSeparate minimum lr for input and output layersNonefloat

checkpointing_args

Option NameDescriptionDefault ValueConstraints
savePath to save checkpointsNonestr
save-intervalIteration interval for saving checkpointsNoneint
no-save-optimDo not save optimizer statedisabledBoolean
no-save-rngDo not save rng statedisabledBoolean
loadPath to load checkpointsNonestr
no-load-optimDo not load optimizer state from checkpointdisabledBoolean
no-load-rngDo not load rng state from checkpointdisabledBoolean
non-persistent-save-intervalIteration interval for non-persistent savesNoneint
non-persistent-ckpt-typeType of non-persistent model checkpointNone (do not use non-persistent checkpointing)"global" (Luster) or "local" (SSD/ramdisk per rank, TBD) or "in_memory" (special method to prevent serialization, TBD)
non-persistent-global-ckpt-dirDirectory for global non-persistent model checkpointsNonestr
finetuneSet iteration to 0 and do not load optimizer or rng state from checkpoint for finetuningdisabledBoolean
pretrained-checkpointCheckpoint directory for a pretrained model for finetuningNonestr
ckpt-stepStep of the checkpoint to loadNoneint
no-initializationSkip initialization during model construction to reduce startup time when loading from a checkpointdisabledBoolean
use-checkpoint-argsOverwrite current args with args from the checkpointdisabledBoolean
exit-on-missing-checkpointExit instead of training with random parameters if checkpoint loading failsdisabledBoolean
use-dist-ckptUse distributed checkpoint formatdisabledBoolean
auto-detect-ckpt-formatAutomatically detect checkpoint formatdisabledBoolean
dist-ckpt-formatDistributed checkpoint format"torch_dist""zarr" or "torch_dist"
ckpt-fully-parallel-saveDeprecated as it is now the defaultdisabledBoolean
no-ckpt-fully-parallel-saveDo not perform ckpt-fully-parallel-savedisabledBoolean
async-saveSave checkpoints asynchronouslydisabledBoolean
ckpt-fully-parallel-loadLoad checkpoints saved with ckpt-fully-parallel-savedisabledBoolean
ckpt-assume-constant-structureAssume that model and optimizer structure are constantdisabledBoolean
dist-ckpt-strictnessMethod for handling key mismatches during distributed checkpoint loading"assume_ok_unexpected"StrictHandling.values

mixed_precision_args

Option NameDescriptionDefault ValueConstraints
fp16Run the model in fp16disabledBoolean
bf16Run the model in bfloat16disabledBoolean
loss-scaleStatic loss scaling. Setting a power of 2 can be expected to improve fp16 convergence.None (dynamic loss scaling)float
initial-loss-scaleInitial value for dynamic loss scaling2**32float
min-loss-scaleMinimum value for dynamic loss scaling1.0float
loss-scale-windowUp/down window for dynamic scale1000float
hysteresisHysteresis for dynamic loss scaling2int
fp32-residual-connectionMove residual connections to FP32disabledBoolean
apply-query-key-layer-scalingScale Q * K^T by 1 / layer-numberdisabledBoolean
attention-softmax-in-fp32Calculate attention masking and softmax in fp32 when --no-query-key-layer-scaling is setdisabledBoolean
accumulate-allreduce-grads-in-fp32Perform gradient accumulation and allreduce in FP32disabledBoolean
fp16-lm-cross-entropyMove unreduced cross entropy loss calculation for lm head to fp16disabledBoolean

distributed_args

Option NameDescriptionDefault ValueConstraints
tensor-model-parallel-sizeTensor Parallelism size (TP size)1int
pipeline-model-parallel-sizePipeline Parallelism size (PP size)1int
encoder-pipeline-model-parallel-sizePP size for encoderNoneint
pipeline-model-parallel-split-rankRank to split between encoder and decoder. Deprecated: use encoder-pipeline-model-parallel-size.Noneint
model-parallel-sizeOld option. Use tensor-model-parallel-sizeNoneint
num-layers-per-virtual-pipeline-stageNumber of layers per virtual pipeline stageNoneint
no-overlap-p2p-communicationDo not overlap pipeline parallel communication with forward/backward chunksdisabledBoolean
distributed-backendBackend for distributed training"nccl""nccl" or "gloo"
distributed-timeout-minutesTimeout in minutes for torch.distributed.10int
overlap-grad-reduceOverlap DDP grad reducedisabledBoolean
defer-embedding-wgrad-computeDefer vocabulary projection linear layer weight gradient computation until pipeline flushdisabledBoolean
wgrad-deferral-limitNumber of microbatches to defer in defer-embedding-wgrad-compute0int
no-delay-grad-reduceDo not delay/synchronize grad reduction in all PP stages except the firstdisabledBoolean
ddp-bucket-sizeBucket size for data-parallel communicationNoneint
ddp-average-in-collectiveCalculate average in collective communicationdisabledBoolean
overlap-param-gatherOverlap parameter all-gatherdisabledBoolean
delay-param-gatherDelay/synchronize parameter all-gather in all PP stages except the firstdisabledBoolean
no-scatter-gather-tensors-in-pipelineDo not use scatter/gather to optimize tensor communication in the pipelinedisabledBoolean
use-ring-exchange-p2pUse custom-built ring exchange for p2p communicationsdisabledBoolean
local-rankLocal rankdisabledBoolean
lazy-mpu-initSkip DDP initialization during initialize_megatron() and return an alternative function. For external DDP managers.disabledBoolean
standalone-embedding-stagePlace the input embedding layer in a pipeline stagedisabledBoolean
use-distributed-optimizerUse distributed optimizerdisabledBoolean
context-parallel-sizeDimension size for context parallelism1int
nccl-communicator-config-pathPath to NCCL communicator config yaml. Sets min_ctas, max_ctas, and cga_cluster_size.Nonestr
use-tp-pp-dp-mappingAssign ranks using tp-pp-dp instead of the default tp-dp-ppNonestr

validation_args

Option NameDescriptionDefault ValueConstraints
eval-itersNumber of iterations for evaluation/validation/testing100int
eval-intervalIteration interval for evaluation1000int
test-modePerform real-time testing in parallel with trainingdisabledBoolean
skip-trainSkip training and run only evaluationdisabledBoolean

data_args

Option NameDescriptionDefault ValueConstraints
data-pathPath to the dataset, multiple paths can be specifiedNone
splitSplit ratio for training, validation, and test sets"969, 30, 1"str
train-data-pathPath to the training datasetNone
valid-data-pathPath to the validation datasetNone
test-data-pathPath to the test datasetNone
data-cache-pathStorage path for cache index filesNone
no-mmap-bin-filesDisable mmap-ing of .bin filesdisabledBoolean
mock-dataSkip data load, validation, and optimization, and generate artificial mock datadisabledBoolean
vocab-sizeVocabulary sizeNoneint
vocab-filePath to the vocabulary fileNoneint
merge-filePath to the BPE merge fileNoneint
vocab-extra-idsNumber of extra vocabulary tokens0int
seq-lengthMaximum sequence lengthNoneint
encoder-seq-lengthMaximum sequence length for encoderNoneint
decoder-seq-lengthMaximum sequence length for decoderNoneint
retriever-seq-lengthMaximum sequence length for retriever in biencoder models256int
sample-rateSample rate for training data1.0float, (0, 1)
mask-probProbability of replacing a token with a mask0.15float
short-seq-probProbability of generating short sequences0.1float
num-workersDataloader number of workers2int
tokenizer-typeTokenizer to useNone"BertWordPieceLowerCase" or "BertWordPieceCase" or "GPT2BPETokenizer" or "SentencePieceTokenizer" or "GPTSentencePieceTokenizer" or "HuggingFaceTokenizer" or "Llama2Tokenizer" or "Llama3Tokenizer" or "MistralTokenizer" or "TikTokenizer" or "NullTokenizer"
tokenizer-modelSentencepiece tokenizer modelNonestr
tiktoken-patternTiktoken tokenizer versionNone"v1" or "v2"
tiktoken-num-special-tokensNumber of special tokens for Tiktoken tokenizer1000int
tiktoken-special-tokensTiktoken special tokens example: ["<unk>", "<s>", "</s>"]Nonearray of str
reset-position-idsReset position IDs after the end-of-document tokendisabledBoolean
reset-attention-maskReset attention mask after the end-of-document tokendisabledBoolean
eod-mask-lossMask loss for end-of-document tokendisabledBoolean
no-create-attention-mask-in-dataloaderDo not create attention masks in the dataloaderdisabledBoolean
num-dataset-builder-threadsNumber of threads per rank for the dataset builder1int
s3-cache-pathPath for cache index files when using s3 dataloaderNonestr

autoresume_args

Option NameDescriptionDefault ValueConstraints
adlr-autoresumeEnable autoresume on adlr clusterdisabledBoolean
adlr-autoresume-intervalInterval to check for autoresume signal1000int

moe_args

Option NameDescriptionDefault ValueConstraints
expert-model-parallel-sizeDimension of expert model parallelism1int
num-expertsNumber of experts in MoENoneint
moe-router-load-balancing-typeLoad balancing type"aux_loss""aux_loss" or "sinkhorn" or "none"
moe-router-topkNumber of experts to route each token to2int
moe-router-pre-softmaxEnable pre-softmax routingdisabledBoolean
moe-grouped-gemmUtilize Grouped GEMMdisabledBoolean
moe-aux-loss-coeffScaling coefficient for aux loss. Recommended 1e-2.0.0float
moe-z-loss-coeffScaling coefficient for z-loss. Recommended 1e-3.Nonefloat
moe-input-jitter-epsJitter epsilon for noise applied to input tensorsNonefloat
moe-token-dispatcher-typeMoE token dispatcher type"allgather""allgather" or "alltoall"
moe-per-layer-loggingEnable per-layer loggingdisabledBoolean
moe-expert-capacity-factorCapacity factor for each expertNonefloat
moe-pad-expert-input-to-capacityPad expert input to capacitydisabledBoolean
moe-token-drop-policyDrop tokens policy"probs"probs or "position"
moe-layer-recomputeEnable checkpointing for moe_layerdisabledBoolean
moe-extended-tpEnable expert parallelismdisabledBoolean

logging_args

Option NameDescriptionDefault ValueConstraints
log-params-normCalculate and log parameter normsdisabledBoolean
log-num-zeros-in-gradCalculate and log number of zeros in gradientsdisabledBoolean
log-throughputCalculate and log throughputdisabledBoolean
log-progressOutput progress to progress.txt in the checkpoint directorydisabledBoolean
timing-log-levelTiming log level00 (Iteration info only) or 1 (Some operations) or 2 (All operations)
no-barrier-with-level-1-timingDo not use barrier with Level 1 timerdisabledBoolean
timing-log-optionMethod for reporting time across multiple ranks"minmax""max" (Report max only) or "minmax" (Report min and max) or "all" (Report all)
tensorboard-log-intervalLogging interval for TensorBoard1int
tensorboard-queue-sizeEvent queue size for TensorBoard1000int
log-timers-to-tensorboardLog timers to TensorBoarddisabledBoolean
no-log-loss-scale-to-tensorboardDo not log loss scale to TensorBoarddisabledBoolean
log-validation-ppl-to-tensorboardLog validation perplexity to TensorBoarddisabledBoolean
log-memory-to-tensorboardLog memory to TensorBoarddisabledBoolean
log-world-size-to-tensorboardLog world-size to TensorBoarddisabledBoolean
wandb-projectWandb project name""str
wandb-exp-nameWandb experiment name""str
wandb-save-dirDirectory to save Wandb results locally""str
logging-levelDefault logging levelNonestr

straggler_detector_args

Option NameDescriptionDefault ValueConstraints
log-stragglerLog stragglers per GPUdisabledBoolean
disable-straggler-on-startupDisable StragglerDetector on startupdisabledBoolean
straggler-ctrlr-portPort to turn StragglerDetector on/off65535int
straggler-minmax-countNumber of ranks to report high/low estimated throughput1int

inference_args

Option NameDescriptionDefault ValueConstraints
inference-batch-times-seqlen-thresholdDuring inference, if batch size × sequence length falls below this value, do not use pipelining512int
max-tokens-to-oomMaximum number of tokens during inference to prevent OOM (Out of Memory)12000int
output-bert-embeddingsOutput Bert embeddingsdisabledBoolean
bert-embedder-typeBert embedder"megatron""megatron" or "huggingface"

transformer_engine_args

Option NameDescriptionDefault ValueConstraints
fp8-formatFP8 formatNonee4m3 or "hybrid"
fp8-marginScaling margin for FP80int
fp8-intervalScaling update interval for FP80int
fp8-amax-history-lenNumber of steps to record amax history per tensor1int
fp8-amax-compute-algoAlgorithm to compute amax from history"most_recent""most_recent" or "max"
no-fp8-wgradCalculate wgrad in high precision even if other calculations are set to FP8disabledBoolean
transformer-implTransformer implementation"local""local" or "transformer_engine"

experimental_args

Option NameDescriptionDefault ValueConstraints
specSpecify a <module_location function_name> pairNonestr
hybrid-attention-ratioRatio of attention layers to total layers0.0float, [0,1]
hybrid-mlp-ratioRatio of mlp layers to total layers0.0float, [0,1]
hybrid-override-patternForce a hybrid layer patternNonestr
yaml-cfgConfig file for additional argsNonestr

one_logger_args

Option NameDescriptionDefault ValueConstraints
no-one-loggerDisable one_loggerdisabledBoolean
one-logger-projectOne_logger project nameNonestr
one-logger-run-nameOne-logger display nameNonestr
one-logger-asyncOne-logger async modeNonestr
app-tag-run-nameOne-logger tag nameNonestr
app-tag-run-versionOne-logger version0.0.0str