Tuning for Distributed Learning Frameworks
This document explains tuning examples for frameworks widely used in distributed learning.
This document uses tuning scripts from AIBooster Examples to tune training code for DeepSpeed and Megatron-LM included in MDK: Model Development Kit. As a prerequisite, please clone mdk and aibooster-examples.
DeepSpeed
Using CommandOutputTuner, we rewrite the DeepSpeed configuration file depending on hyperparameters, execute distributed training with the deepspeed command, and minimize train_runtime in the standard output.
This example assumes that you have completed MDK's step-by-step.md.
# Move to MDK directory
cd mdk/
# Create Python virtual environment
python3 -m venv .venv
. .venv/bin/activate
# Install ZenithTune
pip install aibooster
# Copy DeepSpeed tuning script to current directory
cp <aibooster-examples repository>/intelligence/zenith_tune/frameworks/deepspeed/* .
# Execute tuning with a single process
python tune_deepspeed.py