NCCL Communication Pattern Analyzer
This tool can analyze the NCCL communication pattern of an application during execution.
Usage
# 1. Run the application with NCCL logging enabled and save the log output
$ NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL /path/to/application_you_want_to_analyze | tee out.log
# 2. Analyze the pattern from the saved log
$ cd <directory where FAIB package was cloned>/observability/nccl/pattern_analyzer
$ python nccl_pattern_analyzer.py /path/to/out.log
routine,num_samples,datatype,nranks,ALGO,PROTOCOL,TOTALBYTES,TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC
allreduce,1,float32,8,Ring,LL,4,15.00005
allreduce,262144,float32,8,Ring,LL,1048576,28.01424
allreduce,524288,float32,8,Nvls,Simple,2097152,32.864319
allreduce,1048576,float32,8,Nvls,Simple,4194304,40.728642
allreduce,2097152,float32,8,Nvls,Simple,8388608,56.457283
allreduce,4194304,float32,8,Nvls,Simple,16777216,87.914566
allreduce,8388608,float32,8,Nvls,Simple,33554432,150.829132
allreduce,16777216,float32,8,Nvls,Simple,67108864,276.658264
allreduce,33554432,float32,8,Nvls,Simple,134217728,528.316528
allreduce,67108864,float32,8,Nvls,Simple,268435456,1031.633057
allreduce,134217728,float32,8,Nvls,Simple,536870912,2038.265991
allreduce,268435456,float32,8,Nvls,Simple,1073741824,4051.531982
allreduce,536870912,float32,8,Nvls,Simple,2147483648,8078.063965
allreduce,1073741824,float32,8,Nvls,Simple,4294967296,16131.12793
allreduce,2147483648,float32,8,Nvls,Simple,8589934592,32237.255859
allreduce,4294967296,float32,8,Nvls,Simple,17179869184,64449.511719
Explanation of Analysis Results
The execution results are displayed on standard output in CSV format as shown below:
routine,num_samples,datatype,nranks,ALGO,PROTOCOL,TOTALBYTES,TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC
allreduce,8388608,float32,8,Nvls,Simple,33554432,150.829132
allreduce,16777216,float32,8,Nvls,Simple,67108864,276.658264
allreduce,33554432,float32,8,Nvls,Simple,134217728,528.316528
Each row shows a NCCL communication pattern used in the application executed. For example,
allreduce,16777216,float32,8,Nvls,Simple,67108864,276.658264
This row means the following:
- The communication routine name was
allreduce
(all-reduce) - The number of communication elements was
16777216
- The data type used was
float32
- The number of GPUs where communication occurred was
8
- The communication algorithm used was
NVLS
- The protocol used was
Simple
- The data length communicated was
67108864
bytes - The estimated time taken for one communication was
276.658264
microseconds- Note that this value may not be extractable depending on the NCCL version being used.
Example Usage
As a specific example, we will explain the case of analyzing the NCCL communication pattern that occurs during single-node execution in Pre-training using Megatron-LM from MDK
0. Confirm standard execution
First, in a plain state without any modifications, please confirm that you can perform single-node execution according to the procedure described in the how-to guide.
mpirun /opt/singularity/bin/singularity exec --nv ./pytorch.sif bash examples/gpt3/train_gpt3_7b_mpi.sh ./checkpoint_1node ./tensorboard ./gpt2/vocab.json ./gpt2/merges.txt ./arxiv_text_document
1. Execute with NCCL logging
Next, following the procedure at the beginning, execute the same command with the logging options NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL
and save the standard output using | tee out.log
.
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL mpirun /opt/singularity/bin/singularity exec --nv ./pytorch.sif bash examples/gpt3/train_gpt3_7b_mpi.sh ./checkpoint_1node ./tensorboard ./gpt2/vocab.json ./gpt2/merges.txt ./arxiv_text_document | tee out.log
Wait for the execution to complete while a large amount of NCCL logs are displayed during training.
Note that during training, the processing content is known to be essentially the same across all iterations, so you may adjust --train-iters
or other parameters within the training script to a smaller number. Here, we will execute with a limit of 20 iterations and proceed to the next step.
2. Analyze
Return to this directory and execute the analysis.
$ cd <directory where FAIB package was cloned>/observability/nccl/pattern_analyzer
$ python nccl_pattern_analyzer.py ../../../mdk/outputs/Megatron-LM/out.log
routine,num_samples,datatype,nranks,ALGO,PROTOCOL,TOTALBYTES,TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC
allgather,1,float32,8,Ring,LL,32,10.800199
allgather,2,float32,8,Ring,LL,64,10.800397
allgather,24,float32,8,Ring,LL,768,10.804766
allgather,5571584,float16,8,Ring,Simple,89145344,248.872696
allgather,8389632,float16,8,Ring,Simple,134234112,358.46347
allgather,11141120,float16,8,Ring,Simple,178257920,465.465759
allgather,25755648,float16,8,Ring,Simple,412090368,1033.808472
allgather,27853824,float16,8,Ring,Simple,445661184,1115.404175
allreduce,1,uint8,8,Ring,LL,1,15.000012
allreduce,1,float32,8,Ring,LL,4,15.00005
allreduce,1,float64,8,Ring,LL,8,15.000099
allreduce,2,float32,8,Ring,LL,8,15.000099
broadcast,3,int64,8,Ring,LL,24,7.20017
reducescatter,5571584,float16,8,Ring,Simple,89145344,248.872696
reducescatter,8389632,float16,8,Ring,Simple,134234112,358.46347
reducescatter,11141120,float16,8,Ring,Simple,178257920,465.465759
reducescatter,25755648,float16,8,Ring,Simple,412090368,1033.808472
reducescatter,27853824,float16,8,Ring,Simple,445661184,1115.404175
3. Discussion of Results
From the above results, you can see the following:
- There are 4 communication routines (allgather, allreduce, broadcast, reducescatter)
- allgather and reducescatter are performing large communications, approximately 400 [MB]
- allreduce and broadcast are small communications, so they don't need to be considered.