Tuning for Distributed Learning Frameworks
This document explains tuning examples for frameworks widely used in distributed learning.
Since we use MDK: Model Development Kit as sample code for DeepSpeed and Megatron-LM, please clone it at the same level as faib as shown below.
$ ls
faib mdk
DeepSpeed
Using CommandOutputTuner, we rewrite the DeepSpeed configuration file depending on hyperparameters, execute distributed training with the deepspeed
command, and minimize train_runtime
in the standard output.
This example assumes that you have completed MDK's step-by-step.md.
# Move to MDK directory
cd mdk/
# Create Python virtual environment
python3 -m venv .venv
. .venv/bin/activate
# Install ZenithTune
pip install ../faib/intelligence/zenith-tune
# Copy DeepSpeed tuning script to current directory
cp ../faib/intelligence/zenith-tune/examples/deepspeed/* .
# Execute tuning with a single process
python tune_deepspeed.py
Megatron-LM
Using CommandOutputTuner, we rewrite values in the Megatron-LM training script while extracting throughput for 100 iterations from standard output.
We optimize the harmonic mean excluding the first 10 iterations of warmup.
In this example, we generate single-process command strings and execute the tuning script itself in parallel with mpirun
.
This assumes that you have completed MDK's pretrain-megatron-lm.md.
# Move to MDK's Megatron-LM working directory
cd mdk/outputs/Megatron-LM/
# Install ZenithTune inside Singularity container
singularity exec --nv --bind ../../../faib/intelligence/zenith-tune:/zenith-tune ./pytorch.sif pip install /zenith-tune
# Copy tuning script to current directory
cp ../../../faib/intelligence/zenith-tune/examples/megatronlm/* .
# Execute tuning with multiple processes using mpirun
mpirun --np 8 singularity exec --nv ./pytorch.sif python tune_megatronlm.py
You can expect further speedup by tuning other parameters referring to Megatron-LM's performance-related options.
Options to Consider First
Please consider changing the following parameters in particular when aiming for performance improvement:
tensor-model-parallel-size
: Number of tensor parallelismpipeline-model-parallel-size
: Number of pipeline parallelismglobal-batch-size
: Number of samples consumed in one training iterationmicro-batch-size
: Number of samples consumed by one GPUseq-length
: String length trained in one sample
Since these parameters have a significant impact on system performance and training stability, please set them with the following points in mind:
-
Required conditions
- Total GPU count (= node count × 8) mod (
tensor-model-parallel-size
×pipeline-model-parallel-size
) = 0 global-batch-size
mod (micro-batch-size
× total GPU count /tensor-model-parallel-size
/pipeline-model-parallel-size
) = 0
- Total GPU count (= node count × 8) mod (
-
Adjustment items
- Reduce
tensor-model-parallel-size
×pipeline-model-parallel-size
within the memory usage limit - Increase
micro-batch-size
×seq-length
within the memory usage limit - Increase
global-batch-size
within the range that does not worsen training accuracy
- Reduce
Additionally, using the following options may also improve performance:
--overlap-param-gather
: Reduces GPU idle time by overlapping communication for exchanging parameters between GPUs with computation.
Options Not Recommended for Change
At least the following options are likely to degrade processing performance when disabled, so we recommend not changing them from the default values in the how-to guide explaining distributed parallel pre-training using Megatron-LM.
--use-distributed-optimizer
: Stores optimizer parameters distributed corresponding to parameters calculated in parts on each GPU. This improves computational efficiency. In particular, when used with bf16 format, significant speedup can be expected. See reference for details.--overlap-grad-reduce
: Overlaps gradient aggregation communication with computation. This improves effective performance by hiding communication time in computation. See reference for details.--sequence-parallel
: Efficiently utilizes GPU computational resources by parallelizing in the sequence direction. In particular, it can parallelize layer norm, attention, and dropout layers where tensor parallelism cannot be applied. See Megatron-LM paper for details.
MMEngine
Using CommandOutputTuner, we set hyperparameters in the training script option --cfg-options
and optimize the execution time for 100 iterations in standard output.
# Move to MMEngine examples directory
cd faib/intelligence/zenith-tune/examples/mmengine/
# Create Python virtual environment
python3 -m venv .venv
. .venv/bin/activate
pip install torch torchvision mmengine
# Install ZenithTune
pip install ../../
# Download CIFAR-10 dataset
mkdir -p data/cifar10
pushd data/cifar10/
wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzf cifar-10-python.tar.gz
popd
# Execute tuning with multiple processes using torchrun
torchrun --nproc-per-node 4 tune_mmengine.py