Version: v2603

Tuning for Distributed Learning Frameworks

This document explains tuning examples for frameworks widely used in distributed learning.

This document uses tuning scripts from AIBooster Examples to tune training code for DeepSpeed and Megatron-LM included in MDK: Model Development Kit. As a prerequisite, please clone mdk and aibooster-examples.

DeepSpeed

Using CommandOutputTuner, we rewrite the DeepSpeed configuration file depending on hyperparameters, execute distributed training with the deepspeed command, and minimize train_runtime in the standard output.

This example assumes that you have completed MDK's step-by-step.md.

# Move to MDK directory
cd mdk/

# Create Python virtual environment
python3 -m venv .venv
. .venv/bin/activate

# Install ZenithTune
pip install aibooster --extra-index-url https://assets.aibooster.fixstars.com/intelligence/python

# Copy DeepSpeed tuning script to current directory
cp <aibooster-examples repository>/intelligence/zenith_tune/frameworks/deepspeed/* .

# Execute tuning with a single process
python tune_deepspeed.py

Megatron-LM

Using CommandOutputTuner, we rewrite values in the Megatron-LM training script while extracting throughput for 100 iterations from standard output. We optimize the harmonic mean excluding the first 10 iterations of warmup. In this example, we generate single-process command strings and execute the tuning script itself in parallel with mpirun. This assumes that you have completed MDK's pretrain-megatron-lm.md.

# Move to MDK's Megatron-LM working directory
cd mdk/outputs/Megatron-LM/

# Install ZenithTune inside Singularity container
singularity exec --nv ./pytorch.sif pip install aibooster --extra-index-url https://assets.aibooster.fixstars.com/intelligence/python

# Copy tuning script to current directory
cp <aibooster-examples repository>/intelligence/zenith_tune/frameworks/megatronlm/* .

# Execute tuning with multiple processes using mpirun
mpirun --np 8 singularity exec --nv ./pytorch.sif python tune_megatronlm.py

You can expect further speedup by tuning other parameters referring to Megatron-LM's performance-related options.

Options to Consider First

Please consider changing the following parameters in particular when aiming for performance improvement:

tensor-model-parallel-size: Number of tensor parallelism
pipeline-model-parallel-size: Number of pipeline parallelism
global-batch-size: Number of samples consumed in one training iteration
micro-batch-size: Number of samples consumed by one GPU
seq-length: String length trained in one sample

Since these parameters have a significant impact on system performance and training stability, please set them with the following points in mind:

Required conditions
- Total GPU count (= node count × 8) mod (tensor-model-parallel-size × pipeline-model-parallel-size) = 0
- global-batch-size mod (micro-batch-size × total GPU count / tensor-model-parallel-size / pipeline-model-parallel-size) = 0
Adjustment items
1. Reduce tensor-model-parallel-size × pipeline-model-parallel-size within the memory usage limit
2. Increase micro-batch-size × seq-length within the memory usage limit
3. Increase global-batch-size within the range that does not worsen training accuracy

Additionally, using the following options may also improve performance:

--overlap-param-gather: Reduces GPU idle time by overlapping communication for exchanging parameters between GPUs with computation.

Options Not Recommended for Change

At least the following options are likely to degrade processing performance when disabled, so we recommend not changing them from the default values in the how-to guide explaining distributed parallel pre-training using Megatron-LM.

--use-distributed-optimizer: Stores optimizer parameters distributed corresponding to parameters calculated in parts on each GPU. This improves computational efficiency. In particular, when used with bf16 format, significant speedup can be expected. See reference for details.
--overlap-grad-reduce: Overlaps gradient aggregation communication with computation. This improves effective performance by hiding communication time in computation. See reference for details.
--sequence-parallel: Efficiently utilizes GPU computational resources by parallelizing in the sequence direction. In particular, it can parallelize layer norm, attention, and dropout layers where tensor parallelism cannot be applied. See Megatron-LM paper for details.

MMEngine

Using CommandOutputTuner, we set hyperparameters in the training script option --cfg-options and optimize the execution time for 100 iterations in standard output.

# Move to MMEngine examples directory
cd <aibooster-examples repository>/intelligence/zenith_tune/frameworks/mmengine/

# Create Python virtual environment
python3 -m venv .venv
. .venv/bin/activate
pip install aibooster torch torchvision mmengine --extra-index-url https://assets.aibooster.fixstars.com/intelligence/python

# Download CIFAR-10 dataset
mkdir -p data/cifar10
pushd data/cifar10/
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzf cifar-10-python.tar.gz
popd

# Execute tuning with multiple processes using torchrun
torchrun --nproc-per-node 4 tune_mmengine.py

DeepSpeed​

Megatron-LM​

Options to Consider First​

Options Not Recommended for Change​

MMEngine​

DeepSpeed

Megatron-LM

Options to Consider First

Options Not Recommended for Change

MMEngine