List of Megatron-LM Acceleration Options
network_size_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
num-layers | Number of transformer layers | None | Model dependent, int |
encoder-num-layers | Number of transformer encoder layers | None | Model dependent, int |
decoder-num-layers | Number of transformer decoder layers | None | Model dependent, int |
hidden-size | Hidden size of transformer | None | Model dependent, int |
ffn-hidden-size | Hidden size of transformer FFN | 4 * hidden-size | Model dependent, int |
num-attention-heads | Number of transformer attention heads | None | Model dependent, int |
kv-channels | Projection weights dimension for multi-head attention | hidden-size // num-attention-heads | Model dependent, int |
group-query-attention | Use group-query-attention | Disabled | Boolean |
num-query-groups | Number of query groups for group-query-attention | 1 | int |
max-position-embeddings | Position embedding size | None | Model dependent, int |
position-embedding-type | Deprecated. Position embedding type | "learned_absolute" | learned_absolute or "rope" or "none" |
use-rotary-position-embeddings | Use rotary position embeddings | Disabled | Boolean |
rotary-base | Theta value for rotary position embeddings | 10000 | int |
rotary-percent | Usage percentage of rotary dimension | 1.0 (100%) | float |
rotary-interleaved | Use interleaved rotary embedding | Disabled | Boolean |
rotary-seq-len-interpolation-factor | Sequence length interpolation factor for rotary embeddings | None | int |
no-position-embedding | Do not use position embedding | Disabled | Boolean |
make-vocab-size-divisible-by | A single size by which the vocab size is divisible for computational efficiency. The vocab size will be padded to this value. | 128 | int |
normalization | Layer normalization type | "layernorm" | "layernorm" or "rmsnorm" |
norm-epsilon | Layer norm epsilon | 1e-5 | float |
apply-layernorm-1p | Adjust layer norm to be centered around 0 to improve numerical stability | Disabled | Boolean |
apply-residual-connection-post-layernorm | Use the original BERT residual connection order | Disabled | Boolean |
openai-gelu | Use OpenAI GeLU (deprecated except for ensuring backward compatibility) | Disabled | Boolean |
squared-relu | Use squared relu activation instead of the default gelu | Disabled | Boolean |
swiglu | Use gated linear units and SiLU activation instead of the default gelu | Disabled | Boolean |
onnx-safe | Option to avoid known issues with the ONNX exporter | false | bool |
bert-no-binary-head | Disable BERT binary head | Disabled | Boolean |
untie-embeddings-and-output-weights | Untie embeddings and output weights | Disabled | Boolean |
regularization_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
attention-dropout | Dropout rate after attention | 0.1 | float |
hidden-dropout | Dropout rate for hidden state transformer | 0.1 | float |
weight-decay | Weight decay coefficient for L2 regularization | 0.01 | float |
start-weight-decay | Initial weight decay coefficient for L2 regularization | None | float |
end-weight-decay | Final weight decay coefficient for L2 regularization | None | float |
weight-decay-incr-style | Weight decay increment function | "constant" | "constant" or "linear" or "cosine" |
clip-grad | Gradient clipping based on global L2 norm | 1.0 | float |
adam-beta1 | Adam coefficient 1 | 0.9 | float |
adam-beta2 | Adam coefficient 2 | 0.999 | float |
adam-eps | Term added to the denominator to improve numerical stability | 1e-8 | float |
sgd-momentum | Momentum factor for SGD | 0.9 | float |
training_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
micro-batch-size | Micro batch size | None | int |
global-batch-size | Global batch size | None(micro-batch-size * data-parallel-size) | int |
rampup-batch-size | Linearly increase batch size per iteration from start batch size up to global-batch-size. Format: --rampup-batch-size <start batch size> <batch size incerement> <ramp-up samples> | None | |
decrease-batch-size-if-needed | Decrease batch_size if it is not divisible by micro_batch_size × dp_size | disabled | Boolean |
recompute-activations | Enable activation recomputation for training large models, sequences, and batch sizes | disabled | Boolean |
recompute-granularity | Granularity of recompute-activations. "full": entire transformer layer, "selective": core attention part. | None | "full" or "selective" |
no-check-for-nan-in-loss-and-grad | Do not check for NaN in loss and gradients | disabled | Boolean |
distribute-saved-activations | Distribute recomputed activations across model parallel groups | disabled | Boolean |
recompute-method | Recomputation method. "uniform": Divide transformer layers uniformly and recompute in each chunk. "block": Recompute per pipeline stage. | None | "uniform" or "block" |
recompute-num-layers | Number of layers to recompute using the method specified by recompute-method | 1 | int |
no-clone-scatter-output-in-embedding | Do not clone the output of scatter in the embedding layer to the GC original tensor | disabled | Boolean |
profile | Enable nsys profiling | disabled | Boolean |
profile-step-start | Starting global step to profile | 10 | int |
profile-step-end | Ending global step to profile | 12 | int |
profile-ranks | Global ranks to profile | [0] | array of int |
tp-comm-overlap | Overlap TP communication with GEMM kernels | disabled | Boolean |
tp-comm-overlap-cfg | Config file for tp-comm-overlap | None | str |
disable-tp-comm-overlap-ag | Disable overlap of GEMM and All-Gather in pipeline | disabled | Boolean |
disable-tp-comm-overlap-rs | Disable overlap of GEMM and Reduce-Scatter in pipeline | disabled | Boolean |
tp-comm-overlap-rs-dgrad | Overlap Reduce-Scatter with dgrad GEMM | disabled | Boolean |
disable-tp-comm-bulk-dgrad | Disable overlap of All-Gather and bprop activation gradient GEMM | disabled | Boolean |
disable-tp-comm-bulk-wgrad | Disable overlap of All-Gather and bprop weight gradient GEMM | disabled | Boolean |
use-cpu-initialization | Initialize weights on CPU to eliminate differences caused by TP initialization | disabled | Boolean |
empty-unused-memory-level | Call torch.cuda.empty\_cache() per iteration | 0 | 0 (off) or 1 (moderate) or 2 (aggressive) |
deterministic-mode | Run with deterministic behavior for debugging | disabled | Boolean |
check-weight-hash-across-dp-replicas-interval | Interval to check weight hash across DP replicas | None | int |
calculate-per-token-loss | Calculate cross entropy loss for non-padded tokens in the global batch | disabled | Boolean |
deprecated
Option Name | Description | Default Value | Constraints |
---|---|---|---|
checkpoint-activations | Equivalent to recompute-activations | disabled | Boolean |
train-iters | Number of training iterations | None | int (mutually exclusive with train-samples) |
train-samples | Number of training samples | None | int (mutually exclusive with train-iters) |
log-interval | Iteration interval for logging output | 100 | int |
exit-interval | Exit iteration | None | int |
exit-duration-in-mins | Exit duration (minutes) | None | int |
exit-signal-handler | Save checkpoint and exit after receiving SIGTERM | disabled | Boolean |
tensorboard-dir | Directory for tensorboard logs | None | str |
no-masked-softmax-fusion | Disable fusion of query_key_value scaling, masking, and softmax | disabled | Boolean |
no-bias-gelu-fusion | Disable fusion of bias and gelu | disabled | Boolean |
no-bias-swiglu-fusion | Disable fusion of bias and swiglu | disabled | Boolean |
no-bias-dropout-fusion | Disable fusion of bias and dropout | disabled | Boolean |
no-bias-rope-fusion | Disable rope fusion. Only supported by megatron-core. | disabled | Boolean |
cross-entropy-loss-fusion | Fuse cross entropy loss calculation | disabled | Boolean |
use-flash-attn | Use FlashAttention | disabled | Boolean |
disable-bias-linear | Disable bias in linear layers | disabled | Boolean |
add-qkv-bias | Enable bias in QKV linear layer | disabled | Boolean |
optimizer | optimizer function | "adam" | "adam" or "sgd" |
dataloader-type | single pass or multi pass | None | "single" or "cyclic" or "external" |
no-async-tensor-model-parallel-allreduce | This option is ignored | disabled | Boolean |
no-persist-layer-norm | Disable persistent fused layer norm kernel. Supported only for specific hidden sizes. | disabled | Boolean |
sequence-parallel | Enable sequence parallel optimization in Megatron-LM | disabled | Boolean |
no-gradient-accumulation-fusion | Disable fusing gradient accumulation for weight gradient calculation in linear layers | disabled | Boolean |
use-mcore-models | Use megatron-core. However, it is deprecated as mcore is used by default. | disabled | Boolean |
use-legacy-models | Use megatron-legacy | disabled | Boolean |
manual-gc | Do not use the threshold-based garbage collector | disabled | Boolean |
manual-gc-interval | Training step interval to trigger garbage collection | 0 | int (0 does not trigger gc) |
no-manual-gc-eval | Do not perform manual gc during evaluation | disabled | Boolean |
disable-tp-comm-split-ag | Do not overlap All-Gather and fprop GEMM | disabled | Boolean |
disable-tp-comm-split-rs | Do not overlap Reduce-Scatter and fprop GEMM | disabled | Boolean |
initialization_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
seed | Random seed used in python, numpy, pytorch, and cuda | 1234 | int |
data-parallel-random-init | Randomly initialize parameters along data parallel ranks | disabled | Boolean |
init-method-std | Standard deviation of zero-mean normal distribution used for weight initialization | 0.02 | float |
init-method-xavier-uniform | Enable Xavier uniform parameter initialization | disabled | Boolean |
learning_rate_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
lr | Initial learning rate before applying warmup and decay | None | float |
lr-decay-style | Learning rate decay function | "linear" | "constant" or "linear" or "cosine" or "inverse-square-root" or "WSD" |
lr-wsd-decay-style | Decay style for WSD | "exponential" | "exponential" or "linear" or "cosine" |
lr-decay-iters | Number of iterations over which to decay learning rate | None (= train-iters) | int |
lr-decay-samples | Number of samples over which to decay learning rate | None (= train-samples) | int |
lr-wsd-decay-samples | Number of decay samples for WSD | None | int |
lr-wsd-decay-iters | Number of decay iterations for WSD | None | int |
lr-warmup-fraction | Fraction of lr-warmup-(iters/samples) | None | float |
lr-warmup-iters | Number of iterations over which to linearly warmup learning rate | 0 | int |
lr-warmup-samples | Number of samples over which to linearly warmup learning rate | 0 | int |
lr-warmup-init | Initial value for lr warmup | 0.0 | float |
warmup | Old lr warmup parameter, use lr-warmup-* | None | int |
min-lr | Minimum learning rate | 0.0 | float |
override-opt_param-scheduler | Reset all lr scheduler values, ignoring the checkpoint | disabled | Boolean |
use-checkpoint-opt_param-scheduler | Use lr scheduler values from the checkpoint | disabled | Boolean |
decoupled-lr | Separate lr for input and output layers | None | float |
decoupled-min-lr | Separate minimum lr for input and output layers | None | float |
checkpointing_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
save | Path to save checkpoints | None | str |
save-interval | Iteration interval for saving checkpoints | None | int |
no-save-optim | Do not save optimizer state | disabled | Boolean |
no-save-rng | Do not save rng state | disabled | Boolean |
load | Path to load checkpoints | None | str |
no-load-optim | Do not load optimizer state from checkpoint | disabled | Boolean |
no-load-rng | Do not load rng state from checkpoint | disabled | Boolean |
non-persistent-save-interval | Iteration interval for non-persistent saves | None | int |
non-persistent-ckpt-type | Type of non-persistent model checkpoint | None (do not use non-persistent checkpointing) | "global" (Luster) or "local" (SSD/ramdisk per rank, TBD) or "in_memory" (special method to prevent serialization, TBD) |
non-persistent-global-ckpt-dir | Directory for global non-persistent model checkpoints | None | str |
finetune | Set iteration to 0 and do not load optimizer or rng state from checkpoint for finetuning | disabled | Boolean |
pretrained-checkpoint | Checkpoint directory for a pretrained model for finetuning | None | str |
ckpt-step | Step of the checkpoint to load | None | int |
no-initialization | Skip initialization during model construction to reduce startup time when loading from a checkpoint | disabled | Boolean |
use-checkpoint-args | Overwrite current args with args from the checkpoint | disabled | Boolean |
exit-on-missing-checkpoint | Exit instead of training with random parameters if checkpoint loading fails | disabled | Boolean |
use-dist-ckpt | Use distributed checkpoint format | disabled | Boolean |
auto-detect-ckpt-format | Automatically detect checkpoint format | disabled | Boolean |
dist-ckpt-format | Distributed checkpoint format | "torch_dist" | "zarr" or "torch_dist" |
ckpt-fully-parallel-save | Deprecated as it is now the default | disabled | Boolean |
no-ckpt-fully-parallel-save | Do not perform ckpt-fully-parallel-save | disabled | Boolean |
async-save | Save checkpoints asynchronously | disabled | Boolean |
ckpt-fully-parallel-load | Load checkpoints saved with ckpt-fully-parallel-save | disabled | Boolean |
ckpt-assume-constant-structure | Assume that model and optimizer structure are constant | disabled | Boolean |
dist-ckpt-strictness | Method for handling key mismatches during distributed checkpoint loading | "assume_ok_unexpected" | StrictHandling.values |
mixed_precision_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
fp16 | Run the model in fp16 | disabled | Boolean |
bf16 | Run the model in bfloat16 | disabled | Boolean |
loss-scale | Static loss scaling. Setting a power of 2 can be expected to improve fp16 convergence. | None (dynamic loss scaling) | float |
initial-loss-scale | Initial value for dynamic loss scaling | 2**32 | float |
min-loss-scale | Minimum value for dynamic loss scaling | 1.0 | float |
loss-scale-window | Up/down window for dynamic scale | 1000 | float |
hysteresis | Hysteresis for dynamic loss scaling | 2 | int |
fp32-residual-connection | Move residual connections to FP32 | disabled | Boolean |
apply-query-key-layer-scaling | Scale Q * K^T by 1 / layer-number | disabled | Boolean |
attention-softmax-in-fp32 | Calculate attention masking and softmax in fp32 when --no-query-key-layer-scaling is set | disabled | Boolean |
accumulate-allreduce-grads-in-fp32 | Perform gradient accumulation and allreduce in FP32 | disabled | Boolean |
fp16-lm-cross-entropy | Move unreduced cross entropy loss calculation for lm head to fp16 | disabled | Boolean |
distributed_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
tensor-model-parallel-size | Tensor Parallelism size (TP size) | 1 | int |
pipeline-model-parallel-size | Pipeline Parallelism size (PP size) | 1 | int |
encoder-pipeline-model-parallel-size | PP size for encoder | None | int |
pipeline-model-parallel-split-rank | Rank to split between encoder and decoder. Deprecated: use encoder-pipeline-model-parallel-size. | None | int |
model-parallel-size | Old option. Use tensor-model-parallel-size | None | int |
num-layers-per-virtual-pipeline-stage | Number of layers per virtual pipeline stage | None | int |
no-overlap-p2p-communication | Do not overlap pipeline parallel communication with forward/backward chunks | disabled | Boolean |
distributed-backend | Backend for distributed training | "nccl" | "nccl" or "gloo" |
distributed-timeout-minutes | Timeout in minutes for torch.distributed. | 10 | int |
overlap-grad-reduce | Overlap DDP grad reduce | disabled | Boolean |
defer-embedding-wgrad-compute | Defer vocabulary projection linear layer weight gradient computation until pipeline flush | disabled | Boolean |
wgrad-deferral-limit | Number of microbatches to defer in defer-embedding-wgrad-compute | 0 | int |
no-delay-grad-reduce | Do not delay/synchronize grad reduction in all PP stages except the first | disabled | Boolean |
ddp-bucket-size | Bucket size for data-parallel communication | None | int |
ddp-average-in-collective | Calculate average in collective communication | disabled | Boolean |
overlap-param-gather | Overlap parameter all-gather | disabled | Boolean |
delay-param-gather | Delay/synchronize parameter all-gather in all PP stages except the first | disabled | Boolean |
no-scatter-gather-tensors-in-pipeline | Do not use scatter/gather to optimize tensor communication in the pipeline | disabled | Boolean |
use-ring-exchange-p2p | Use custom-built ring exchange for p2p communications | disabled | Boolean |
local-rank | Local rank | disabled | Boolean |
lazy-mpu-init | Skip DDP initialization during initialize_megatron() and return an alternative function. For external DDP managers. | disabled | Boolean |
standalone-embedding-stage | Place the input embedding layer in a pipeline stage | disabled | Boolean |
use-distributed-optimizer | Use distributed optimizer | disabled | Boolean |
context-parallel-size | Dimension size for context parallelism | 1 | int |
nccl-communicator-config-path | Path to NCCL communicator config yaml. Sets min_ctas, max_ctas, and cga_cluster_size. | None | str |
use-tp-pp-dp-mapping | Assign ranks using tp-pp-dp instead of the default tp-dp-pp | None | str |
validation_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
eval-iters | Number of iterations for evaluation/validation/testing | 100 | int |
eval-interval | Iteration interval for evaluation | 1000 | int |
test-mode | Perform real-time testing in parallel with training | disabled | Boolean |
skip-train | Skip training and run only evaluation | disabled | Boolean |
data_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
data-path | Path to the dataset, multiple paths can be specified | None | |
split | Split ratio for training, validation, and test sets | "969, 30, 1" | str |
train-data-path | Path to the training dataset | None | |
valid-data-path | Path to the validation dataset | None | |
test-data-path | Path to the test dataset | None | |
data-cache-path | Storage path for cache index files | None | |
no-mmap-bin-files | Disable mmap-ing of .bin files | disabled | Boolean |
mock-data | Skip data load, validation, and optimization, and generate artificial mock data | disabled | Boolean |
vocab-size | Vocabulary size | None | int |
vocab-file | Path to the vocabulary file | None | int |
merge-file | Path to the BPE merge file | None | int |
vocab-extra-ids | Number of extra vocabulary tokens | 0 | int |
seq-length | Maximum sequence length | None | int |
encoder-seq-length | Maximum sequence length for encoder | None | int |
decoder-seq-length | Maximum sequence length for decoder | None | int |
retriever-seq-length | Maximum sequence length for retriever in biencoder models | 256 | int |
sample-rate | Sample rate for training data | 1.0 | float, (0, 1) |
mask-prob | Probability of replacing a token with a mask | 0.15 | float |
short-seq-prob | Probability of generating short sequences | 0.1 | float |
num-workers | Dataloader number of workers | 2 | int |
tokenizer-type | Tokenizer to use | None | "BertWordPieceLowerCase" or "BertWordPieceCase" or "GPT2BPETokenizer" or "SentencePieceTokenizer" or "GPTSentencePieceTokenizer" or "HuggingFaceTokenizer" or "Llama2Tokenizer" or "Llama3Tokenizer" or "MistralTokenizer" or "TikTokenizer" or "NullTokenizer" |
tokenizer-model | Sentencepiece tokenizer model | None | str |
tiktoken-pattern | Tiktoken tokenizer version | None | "v1" or "v2" |
tiktoken-num-special-tokens | Number of special tokens for Tiktoken tokenizer | 1000 | int |
tiktoken-special-tokens | Tiktoken special tokens example: ["<unk>", "<s>", "</s>"] | None | array of str |
reset-position-ids | Reset position IDs after the end-of-document token | disabled | Boolean |
reset-attention-mask | Reset attention mask after the end-of-document token | disabled | Boolean |
eod-mask-loss | Mask loss for end-of-document token | disabled | Boolean |
no-create-attention-mask-in-dataloader | Do not create attention masks in the dataloader | disabled | Boolean |
num-dataset-builder-threads | Number of threads per rank for the dataset builder | 1 | int |
s3-cache-path | Path for cache index files when using s3 dataloader | None | str |
autoresume_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
adlr-autoresume | Enable autoresume on adlr cluster | disabled | Boolean |
adlr-autoresume-interval | Interval to check for autoresume signal | 1000 | int |
moe_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
expert-model-parallel-size | Dimension of expert model parallelism | 1 | int |
num-experts | Number of experts in MoE | None | int |
moe-router-load-balancing-type | Load balancing type | "aux_loss" | "aux_loss" or "sinkhorn" or "none" |
moe-router-topk | Number of experts to route each token to | 2 | int |
moe-router-pre-softmax | Enable pre-softmax routing | disabled | Boolean |
moe-grouped-gemm | Utilize Grouped GEMM | disabled | Boolean |
moe-aux-loss-coeff | Scaling coefficient for aux loss. Recommended 1e-2. | 0.0 | float |
moe-z-loss-coeff | Scaling coefficient for z-loss. Recommended 1e-3. | None | float |
moe-input-jitter-eps | Jitter epsilon for noise applied to input tensors | None | float |
moe-token-dispatcher-type | MoE token dispatcher type | "allgather" | "allgather" or "alltoall" |
moe-per-layer-logging | Enable per-layer logging | disabled | Boolean |
moe-expert-capacity-factor | Capacity factor for each expert | None | float |
moe-pad-expert-input-to-capacity | Pad expert input to capacity | disabled | Boolean |
moe-token-drop-policy | Drop tokens policy | "probs" | probs or "position" |
moe-layer-recompute | Enable checkpointing for moe_layer | disabled | Boolean |
moe-extended-tp | Enable expert parallelism | disabled | Boolean |
logging_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
log-params-norm | Calculate and log parameter norms | disabled | Boolean |
log-num-zeros-in-grad | Calculate and log number of zeros in gradients | disabled | Boolean |
log-throughput | Calculate and log throughput | disabled | Boolean |
log-progress | Output progress to progress.txt in the checkpoint directory | disabled | Boolean |
timing-log-level | Timing log level | 0 | 0 (Iteration info only) or 1 (Some operations) or 2 (All operations) |
no-barrier-with-level-1-timing | Do not use barrier with Level 1 timer | disabled | Boolean |
timing-log-option | Method for reporting time across multiple ranks | "minmax" | "max" (Report max only) or "minmax" (Report min and max) or "all" (Report all) |
tensorboard-log-interval | Logging interval for TensorBoard | 1 | int |
tensorboard-queue-size | Event queue size for TensorBoard | 1000 | int |
log-timers-to-tensorboard | Log timers to TensorBoard | disabled | Boolean |
no-log-loss-scale-to-tensorboard | Do not log loss scale to TensorBoard | disabled | Boolean |
log-validation-ppl-to-tensorboard | Log validation perplexity to TensorBoard | disabled | Boolean |
log-memory-to-tensorboard | Log memory to TensorBoard | disabled | Boolean |
log-world-size-to-tensorboard | Log world-size to TensorBoard | disabled | Boolean |
wandb-project | Wandb project name | "" | str |
wandb-exp-name | Wandb experiment name | "" | str |
wandb-save-dir | Directory to save Wandb results locally | "" | str |
logging-level | Default logging level | None | str |
straggler_detector_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
log-straggler | Log stragglers per GPU | disabled | Boolean |
disable-straggler-on-startup | Disable StragglerDetector on startup | disabled | Boolean |
straggler-ctrlr-port | Port to turn StragglerDetector on/off | 65535 | int |
straggler-minmax-count | Number of ranks to report high/low estimated throughput | 1 | int |
inference_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
inference-batch-times-seqlen-threshold | During inference, if batch size × sequence length falls below this value, do not use pipelining | 512 | int |
max-tokens-to-oom | Maximum number of tokens during inference to prevent OOM (Out of Memory) | 12000 | int |
output-bert-embeddings | Output Bert embeddings | disabled | Boolean |
bert-embedder-type | Bert embedder | "megatron" | "megatron" or "huggingface" |
transformer_engine_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
fp8-format | FP8 format | None | e4m3 or "hybrid" |
fp8-margin | Scaling margin for FP8 | 0 | int |
fp8-interval | Scaling update interval for FP8 | 0 | int |
fp8-amax-history-len | Number of steps to record amax history per tensor | 1 | int |
fp8-amax-compute-algo | Algorithm to compute amax from history | "most_recent" | "most_recent" or "max" |
no-fp8-wgrad | Calculate wgrad in high precision even if other calculations are set to FP8 | disabled | Boolean |
transformer-impl | Transformer implementation | "local" | "local" or "transformer_engine" |
experimental_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
spec | Specify a <module_location function_name> pair | None | str |
hybrid-attention-ratio | Ratio of attention layers to total layers | 0.0 | float, [0,1] |
hybrid-mlp-ratio | Ratio of mlp layers to total layers | 0.0 | float, [0,1] |
hybrid-override-pattern | Force a hybrid layer pattern | None | str |
yaml-cfg | Config file for additional args | None | str |
one_logger_args
Option Name | Description | Default Value | Constraints |
---|---|---|---|
no-one-logger | Disable one_logger | disabled | Boolean |
one-logger-project | One_logger project name | None | str |
one-logger-run-name | One-logger display name | None | str |
one-logger-async | One-logger async mode | None | str |
app-tag-run-name | One-logger tag name | None | str |
app-tag-run-version | One-logger version | 0.0.0 | str |