PyTorch NCCL Benchmark
This tool can actually measure the communication bandwidth of NCCL.
Its functionality is similar to NCCL Tests, but by going through PyTorch, you can obtain values closer to the actual communication bandwidth in AI processing, thanks to the following advantages:
- Includes the actual overhead when called from PyTorch
- Uses the NCCL utilized in the actual PyTorch environment, not just the NCCL installed in the system.
Environment Setup
Please activate the environment where the PyTorch you want to benchmark is enabled. For example, if you are using the MDK environment, please activate the MDK venv environment according to the tutorial steps.
If there is no specific environment you need to target, please create a new environment dedicated to this benchmark as shown below.
python3 -m venv .venv
. .venv/bin/activate
pip install torch==2.3.1
With either environment activated, please additionally install the packages required for this benchmark into that environment.
pip show mpi4py &> /dev/null || pip install mpi4py==4.0.3 # Use mpi4py if it exists in the environment; otherwise, install the recommended version
Execution Method
For example, to perform AllGather communication with Float32 data type, ranging from 1 [MB] to 8 [GB] with a factor of 2 increase, use a command like the following:
cd <directory where FAIB package was cloned>/observability/nccl/benchmark_torch
mpirun python run.py --routine allgather -b 1M -e 8G -f 2 -d float32
# For multi-node execution, please inherit the venv
mpirun --np 16 --hostfile /opt/etc/host/hostfile.txt bash -c '[ -n "$VIRTUAL_ENV" ] && . $VIRTUAL_ENV/bin/activate; python run.py --routine allgather -b 1M -e 8G -f 2 -d float32'
As a result, you will get CSV format standard output like the following: These bandwidths (algbw, busbw) will be slightly lower compared to pure communication bandwidth. This is because they are measured values including the overhead when calling communication functions via PyTorch.
size(B), count(elements), type, redop, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s)
1048576, 32768, float32, sum, 175.71449279785156, 5.9674986582143825, 5.221561325937585
2097152, 65536, float32, sum, 104.18891906738281, 20.128359318553777, 17.612314403734555
4194304, 131072, float32, sum, 103.19948196411133, 40.64268463536098, 35.56234905594086
8388608, 262144, float32, sum, 107.76519775390625, 77.8415311699823, 68.11133977373451
16777216, 524288, float32, sum, 118.21985244750977, 141.91538605962288, 124.17596280217002
33554432, 1048576, float32, sum, 159.44242477416992, 210.44858071824748, 184.14250812846655
67108864, 2097152, float32, sum, 265.8724784851074, 252.40996880299153, 220.8587227026176
134217728, 4194304, float32, sum, 423.8605499267578, 316.6553906071054, 277.07346678121723
268435456, 8388608, float32, sum, 748.9204406738281, 358.42986974488224, 313.626136026772
536870912, 16777216, float32, sum, 1409.5067977905273, 380.89274407301343, 333.28115106388674
1073741824, 33554432, float32, sum, 2731.4066886901855, 393.1094656998517, 343.97078248737023
2147483648, 67108864, float32, sum, 5328.118801116943, 403.0472532913153, 352.6663466299009
4294967296, 134217728, float32, sum, 10507.237911224365, 408.7627340589574, 357.6673923015877
8589934592, 268435456, float32, sum, 20875.239372253418, 411.4891541515648, 360.05300988261916
The descriptions of each argument are as follows:
--routine
: The NCCL communication routine to execute. Currently, only the following are supported:allreduce
: AllReduceallgather
: AllGatherreducescatter
: ReduceScatterbroadcast
: Broadcast
-b
: Minimum data size to transfer [byte]-e
: Maximum data size to transfer [byte]-f
: Factor of increase between the minimum and maximum data sizes-d
: Data type to use
The arguments, processing content, and output order of the results are designed to be as close as possible to the original NCCL Tests.
How to Run with a Specific Communication Pattern
If you want to run the benchmark with a specific communication pattern, for example, using the output of the NCCL Communication Pattern Analyzer, you can execute as follows:
cd <directory where FAIB package was cloned>/observability/nccl/benchmark_torch
mpirun python run.py --pattern pattern.csv
Here, pattern.csv is a CSV file in the following format, and you can directly use the standard output of the Communication Pattern Analyzer.
routine,num_samples,datatype,nranks,ALGO,PROTOCOL,TOTALBYTES,TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC
allgather,1,float32,8,Ring,LL,32,10.800199
allgather,2,float32,8,Ring,LL,64,10.800397
allgather,24,float32,8,Ring,LL,768,10.804766
allgather,5571584,float16,8,Ring,Simple,89145344,248.872696
allgather,8389632,float16,8,Ring,Simple,134234112,358.46347
allgather,11141120,float16,8,Ring,Simple,178257920,465.465759
allgather,25755648,float16,8,Ring,Simple,412090368,1033.808472
allgather,27853824,float16,8,Ring,Simple,445661184,1115.404175
allreduce,1,uint8,8,Ring,LL,1,15.000012
allreduce,1,float32,8,Ring,LL,4,15.00005
allreduce,1,float64,8,Ring,LL,8,15.000099
allreduce,2,float32,8,Ring,LL,8,15.000099
broadcast,3,int64,8,Ring,LL,24,7.20017
reducescatter,5571584,float16,8,Ring,Simple,89145344,248.872696
reducescatter,8389632,float16,8,Ring,Simple,134234112,358.46347
reducescatter,11141120,float16,8,Ring,Simple,178257920,465.465759
reducescatter,25755648,float16,8,Ring,Simple,412090368,1033.808472
reducescatter,27853824,float16,8,Ring,Simple,445661184,1115.404175
When executed, time (time) and bandwidth (algbw, busbw) will be appended to the end of the above CSV file and output to standard output.
routine, num_samples, datatype, nranks, ALGO, PROTOCOL, TOTALBYTES, TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s)
allgather, 1, float32, 8, Default, Default, 32, 10.800199, 170.27854919433594, 0.0001879273704844581, 0.00016443644917390086
allgather, 2, float32, 8, Default, Default, 64, 10.800397, 99.61128234863281, 0.0006424975011967448, 0.0005621853135471518
allgather, 24, float32, 8, Default, Default, 768, 10.804766, 101.5782356262207, 0.007560674737706842, 0.006615590395493486
allgather, 5571584, float16, 8, Default, Default, 89145344, 248.872696, 307.2500228881836, 290.1394218364057, 253.87199410685497
allgather, 8389632, float16, 8, Default, Default, 134234112, 358.46347, 421.6194152832031, 318.3774445250215, 278.5802639593938
allgather, 11141120, float16, 8, Default, Default, 178257920, 465.465759, 532.0906639099121, 335.0141847822023, 293.137411684427
allgather, 25755648, float16, 8, Default, Default, 412090368, 1033.808472, 1138.6990547180176, 361.8957671849901, 316.65879628686633
allgather, 27853824, float16, 8, Default, Default, 445661184, 1115.404175, 1206.1238288879395, 369.4986976676391, 323.3113604591842
allreduce, 1, uint8, 8, Default, Default, 1, 15.000012, 97.37014770507812, 1.0270088148873653e-05, 1.7972654260528892e-05
allreduce, 1, float32, 8, Default, Default, 4, 15.00005, 95.33166885375977, 4.195877454045267e-05, 7.342785544579218e-05
allreduce, 1, float64, 8, Default, Default, 8, 15.000099, 77.40259170532227, 0.00010335571230555983, 0.0001808724965347297
allreduce, 2, float32, 8, Default, Default, 8, 15.000099, 81.22920989990234, 9.848673906662752e-05, 0.00017235179336659817
broadcast, 3, int64, 8, Default, Default, 24, 7.20017, 75.26874542236328, 0.0003188574469433006, 0.0003188574469433006
reducescatter, 5571584, float16, 8, Default, Default, 89145344, 248.872696, 299.1795539855957, 297.9660301395195, 260.7202763720795
reducescatter, 8389632, float16, 8, Default, Default, 134234112, 358.46347, 419.4021224975586, 320.0606406105668, 280.05306053424596
reducescatter, 11141120, float16, 8, Default, Default, 178257920, 465.465759, 520.8611488342285, 342.2369289761197, 299.45731285410477
reducescatter, 25755648, float16, 8, Default, Default, 412090368, 1033.808472, 1134.0022087097168, 363.3946784538295, 317.97034364710083
reducescatter, 27853824, float16, 8, Default, Default, 445661184, 1115.404175, 1190.352439880371, 374.39431303622007, 327.5950239066926
Note that here, the algorithm (ALGO
) and protocol (PROTOCOL
) are overwritten to Default
.
Due to NCCL constraints, it is not possible to finely control the algorithm and protocol for each communication executed by a single process from outside. Therefore, if you run as-is, you cannot specify the methods specified in the CSV file.
If necessary, you can specify the method to be used for the entire process by explicitly setting the NCCL_ALGO
and NCCL_PROTO
environment variables as shown below.
cd <directory where FAIB package was cloned>/observability/nccl/benchmark_torch
$ mpirun -x NCCL_ALGO=Ring -x NCCL_PROTO=Simple python run.py --pattern pattern.csv
routine, num_samples, datatype, nranks, ALGO, PROTOCOL, TOTALBYTES, TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s)
allgather, 1, float32, 8, Ring, Simple, 32, 10.800199, 170.44544219970703, 0.000187743359910477, 0.00016427543992166737
allgather, 2, float32, 8, Ring, Simple, 64, 10.800397, 102.38885879516602, 0.0006250680079171033, 0.0005469345069274654
allgather, 24, float32, 8, Ring, Simple, 768, 10.804766, 99.79009628295898, 0.007696154514394935, 0.006734135200095568
allgather, 5571584, float16, 8, Ring, Simple, 89145344, 248.872696, 305.5572509765625, 291.7467797445193, 255.2784322764544
allgather, 8389632, float16, 8, Ring, Simple, 134234112, 358.46347, 420.61805725097656, 319.13540012359596, 279.24347510814647
allgather, 11141120, float16, 8, Ring, Simple, 178257920, 465.465759, 537.5146865844727, 331.63358034494564, 290.17938280182744
allgather, 25755648, float16, 8, Ring, Simple, 412090368, 1033.808472, 1144.8264122009277, 359.9588231090482, 314.96397022041714
allgather, 27853824, float16, 8, Ring, Simple, 445661184, 1115.404175, 1199.8295783996582, 371.4370707500196, 325.0074369062671
allreduce, 1, uint8, 8, Ring, Simple, 1, 15.000012, 75.01840591430664, 1.3330061973621484e-05, 2.3327608453837595e-05
allreduce, 1, float32, 8, Ring, Simple, 4, 15.00005, 74.41043853759766, 5.3755898750400514e-05, 9.40728228132009e-05
allreduce, 1, float64, 8, Ring, Simple, 8, 15.000099, 75.18529891967773, 0.00010640377992706517, 0.00018620661487236406
allreduce, 2, float32, 8, Ring, Simple, 8, 15.000099, 70.16658782958984, 0.00011401437988447163, 0.00019952516479782535
broadcast, 3, int64, 8, Ring, Simple, 24, 7.20017, 58.6390495300293, 0.0004092835779630006, 0.0004092835779630006
reducescatter, 5571584, float16, 8, Ring, Simple, 89145344, 248.872696, 291.58592224121094, 305.7258159612232, 267.51008896607027
reducescatter, 8389632, float16, 8, Ring, Simple, 134234112, 358.46347, 404.00028228759766, 332.2624213030676, 290.7296186401841
reducescatter, 11141120, float16, 8, Ring, Simple, 178257920, 465.465759, 522.2082138061523, 341.3541098879971, 298.6848461519975
reducescatter, 25755648, float16, 8, Ring, Simple, 412090368, 1033.808472, 1136.6844177246094, 362.5371840892424, 317.2200360780871
reducescatter, 27853824, float16, 8, Ring, Simple, 445661184, 1115.404175, 1198.4825134277344, 371.8545569141275, 325.37273729986157