Skip to main content
Version: v2602

PyTorch NCCL Benchmark

This tool can actually measure the communication bandwidth of NCCL. To get started, install using pip install.

pip install aibooster

Its functionality is similar to NCCL Tests, but by going through PyTorch, you can obtain values closer to the actual communication bandwidth in AI processing, thanks to the following advantages:

  • Includes the actual overhead when called from PyTorch
  • Uses the NCCL utilized in the actual PyTorch environment, not just the NCCL installed in the system.

The program outputs the efficiency of the actual communication bandwidth relative to the ideal communication bandwidth. If the GPUs are not supported for estimating the ideal communication bandwidth in a single-node environment, the program outputs N/A instead. Similarly, if the GPUs and the internode communication interfaces are not supported in a multi-node environment, the program outputs N/A. Refer to the document for details on how the ideal communication bandwidth is estimated.

Environment Setup

Please activate the environment where the PyTorch you want to benchmark is enabled. For example, if you are using the MDK environment, please activate the MDK venv environment according to the tutorial steps.

If there is no specific environment you need to target, please create a new environment dedicated to this benchmark as shown below.

python3 -m venv .venv
. .venv/bin/activate
pip install torch==2.3.1

With either environment activated, please additionally install the packages required for this benchmark into that environment.

pip show mpi4py &> /dev/null || pip install mpi4py==4.0.3 # Use mpi4py if it exists in the environment; otherwise, install the recommended version
pip install packaging 'nvidia-ml-py==12.570.86' # Install packaging and [pyNVML](https://pypi.org/project/nvidia-ml-py/) 12.570.86.
pip show pynvml &> /dev/null && pip install 'pynvml==12.0.0' # Install [pynvml](https://pypi.org/project/pynvml/) 12.0.0 if pynvml exists in the environment; otherwise, do nothing.

The program requires specific versions of the nvidia-ml-py and pynvml packages. It will not work if the versions do not match those listed above. Please refer to the source code for details.

Execution Method

For example, to perform AllGather communication with Float32 data type, ranging from 1 [MB] to 8 [GB] with a factor of 2 increase, use a command like the following:

mpirun python -m aibooster.observability.nccl_benchmark_torch.run --routine allgather -b 1M -e 8G -f 2 -d float32

# For multi-node execution, please inherit the venv
mpirun --np 16 --hostfile /opt/etc/host/hostfile.txt bash -c '[ -n "$VIRTUAL_ENV" ] && . $VIRTUAL_ENV/bin/activate; python -m aibooster.observability.nccl_benchmark_torch.run --routine allgather -b 1M -e 8G -f 2 -d float32'

As a result, you will get CSV format standard output like the following: These bandwidths (algbw, busbw) will be slightly lower compared to pure communication bandwidth. This is because they are measured values including the overhead when calling communication functions via PyTorch.

size(B), count(elements), type, redop, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s), in-place ideal busbw(GB/s), in-place busbw efficiency(%)
1048576, 32768, float32, sum, 177.3238182067871, 5.913339846862521, 5.174172366004706, 450.0, 1.149816081334379
2097152, 65536, float32, sum, 105.70287704467773, 19.840065461166123, 17.360057278520358, 450.0, 3.857790506337857
4194304, 131072, float32, sum, 103.64055633544922, 40.469717148415, 35.41100250486313, 450.0, 7.869111667747362
8388608, 262144, float32, sum, 103.15179824829102, 81.32294484879695, 71.15757674269733, 450.0, 15.812794831710518
16777216, 524288, float32, sum, 119.85301971435547, 139.98158778130892, 122.4838893086453, 450.0, 27.218642068587844
33554432, 1048576, float32, sum, 165.65322875976562, 202.5582733956937, 177.238489221232, 450.0, 39.38633093805155
67108864, 2097152, float32, sum, 253.17668914794922, 265.0673102087353, 231.9338964326434, 450.0, 51.54086587392075
134217728, 4194304, float32, sum, 426.45931243896484, 314.72575245782525, 275.3850334005971, 450.0, 61.19667408902158
268435456, 8388608, float32, sum, 758.7194442749023, 353.8006809045734, 309.57559579150177, 450.0, 68.79457684255594
536870912, 16777216, float32, sum, 1427.1020889282227, 376.19657077455406, 329.1719994277348, 450.0, 73.1493332061633
1073741824, 33554432, float32, sum, 2745.0919151306152, 391.1496799366406, 342.25596994456055, 450.0, 76.05688220990235
2147483648, 67108864, float32, sum, 5357.1343421936035, 400.8642514499016, 350.7562200186639, 450.0, 77.9458266708142
4294967296, 134217728, float32, sum, 10530.102252960205, 407.8751746966755, 356.8907778595911, 450.0, 79.30906174657581
8589934592, 268435456, float32, sum, 20851.06372833252, 411.96625284531444, 360.4704712396501, 450.0, 80.1045491643667

The descriptions of each argument are as follows:

  • --routine: The NCCL communication routine to execute. Currently, only the following are supported:
    • allreduce: AllReduce
    • allgather: AllGather
    • reducescatter: ReduceScatter
    • broadcast: Broadcast
  • -b: Minimum data size to transfer [byte]
  • -e: Maximum data size to transfer [byte]
  • -f: Factor of increase between the minimum and maximum data sizes
  • -d: Data type to use

The arguments, processing content, and output order of the results are designed to be as close as possible to the original NCCL Tests.

How to Run with a Specific Communication Pattern

If you want to run the benchmark with a specific communication pattern, for example, using the output of the NCCL Communication Pattern Analyzer, you can execute as follows:

mpirun python -m aibooster.observability.nccl_benchmark_torch.run --pattern pattern.csv

Here, pattern.csv is a CSV file in the following format, and you can directly use the standard output of the Communication Pattern Analyzer.

routine,num_samples,datatype,nranks,ALGO,PROTOCOL,TOTALBYTES,TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC
allgather,1,float32,8,Ring,LL,32,10.800199
allgather,2,float32,8,Ring,LL,64,10.800397
allgather,24,float32,8,Ring,LL,768,10.804766
allgather,5571584,float16,8,Ring,Simple,89145344,248.872696
allgather,8389632,float16,8,Ring,Simple,134234112,358.46347
allgather,11141120,float16,8,Ring,Simple,178257920,465.465759
allgather,25755648,float16,8,Ring,Simple,412090368,1033.808472
allgather,27853824,float16,8,Ring,Simple,445661184,1115.404175
allreduce,1,uint8,8,Ring,LL,1,15.000012
allreduce,1,float32,8,Ring,LL,4,15.00005
allreduce,1,float64,8,Ring,LL,8,15.000099
allreduce,2,float32,8,Ring,LL,8,15.000099
broadcast,3,int64,8,Ring,LL,24,7.20017
reducescatter,5571584,float16,8,Ring,Simple,89145344,248.872696
reducescatter,8389632,float16,8,Ring,Simple,134234112,358.46347
reducescatter,11141120,float16,8,Ring,Simple,178257920,465.465759
reducescatter,25755648,float16,8,Ring,Simple,412090368,1033.808472
reducescatter,27853824,float16,8,Ring,Simple,445661184,1115.404175

When executed, time (time) and bandwidth (algbw, busbw) will be appended to the end of the above CSV file and output to standard output.

routine, num_samples, datatype, nranks, ALGO, PROTOCOL, TOTALBYTES, TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s), in-place ideal busbw(GB/s), in-place busbw efficiency(%)
allgather, 1, float32, 8, Default, Default, 32, 10.800199, 167.48905181884766, 0.00019105726405693952, 0.00016717510604982207, 450.0, 3.7150023566627126e-05
allgather, 2, float32, 8, Default, Default, 64, 10.800397, 96.36878967285156, 0.0006641154280059377, 0.0005811009995051955, 450.0, 0.000129133555445599
allgather, 24, float32, 8, Default, Default, 768, 10.804766, 96.59528732299805, 0.007950698437615698, 0.0069568611329137355, 450.0, 0.0015459691406474968
allgather, 5571584, float16, 8, Default, Default, 89145344, 248.872696, 310.7905387878418, 286.8341628020222, 250.97989245176944, 450.0, 55.773309433726546
allgather, 8389632, float16, 8, Default, Default, 134234112, 358.46347, 428.73620986938477, 313.0925471419703, 273.95597874922396, 450.0, 60.87910638871643
allgather, 11141120, float16, 8, Default, Default, 178257920, 465.465759, 538.5160446166992, 331.0169154326294, 289.6398010035507, 450.0, 64.36440022301127
allgather, 25755648, float16, 8, Default, Default, 412090368, 1033.808472, 1149.8689651489258, 358.38028548464035, 313.5827497990603, 450.0, 69.68505551090229
allgather, 27853824, float16, 8, Default, Default, 445661184, 1115.404175, 1209.5451354980469, 368.453537549463, 322.39684535578016, 450.0, 71.6437434123956
allreduce, 1, uint8, 8, Default, Default, 1, 15.000012, 109.36260223388672, 9.143893612382822e-06, 1.6001813821669937e-05, 450.0, 3.555958627037764e-06
allreduce, 1, float32, 8, Default, Default, 4, 15.00005, 107.80096054077148, 3.710542076744444e-05, 6.493448634302777e-05, 450.0, 1.4429885854006171e-05
allreduce, 1, float64, 8, Default, Default, 8, 15.000099, 105.14259338378906, 7.608714739229025e-05, 0.00013315250793650792, 450.0, 2.958944620811287e-05
allreduce, 2, float32, 8, Default, Default, 8, 15.000099, 103.61671447753906, 7.720762080073632e-05, 0.00013511333640128857, 450.0, 3.0025185866953015e-05
broadcast, 3, int64, 8, Default, Default, 24, 7.20017, 106.84728622436523, 0.00022461964967086912, 0.00022461964967086912, 450.0, 4.991547770463758e-05
reducescatter, 5571584, float16, 8, Default, Default, 89145344, 248.872696, 315.05823135375977, 282.9487857433698, 247.58018752544857, 450.0, 55.01781945009968
reducescatter, 8389632, float16, 8, Default, Default, 134234112, 358.46347, 421.07105255126953, 318.79206890779005, 278.9430602943163, 450.0, 61.98734673207029
reducescatter, 11141120, float16, 8, Default, Default, 178257920, 465.465759, 517.7497863769531, 344.29356552204825, 301.25686983179224, 450.0, 66.94597107373161
reducescatter, 25755648, float16, 8, Default, Default, 412090368, 1033.808472, 1130.080223083496, 364.65585325932443, 319.0738716019089, 450.0, 70.90530480042419
reducescatter, 27853824, float16, 8, Default, Default, 445661184, 1115.404175, 1192.5220489501953, 373.71316060137076, 326.9990155261994, 450.0, 72.66644789471098

Note that here, the algorithm (ALGO) and protocol (PROTOCOL) are overwritten to Default. Due to NCCL constraints, it is not possible to finely control the algorithm and protocol for each communication executed by a single process from outside. Therefore, if you run as-is, you cannot specify the methods specified in the CSV file.

If necessary, you can specify the method to be used for the entire process by explicitly setting the NCCL_ALGO and NCCL_PROTO environment variables as shown below.

$ mpirun -x NCCL_ALGO=Ring -x NCCL_PROTO=Simple python -m aibooster.observability.nccl_benchmark_torch.run --pattern pattern.csv

routine, num_samples, datatype, nranks, ALGO, PROTOCOL, TOTALBYTES, TIME_FOR_EACH_NCCL_COMM_CALL_MICROSEC, in-place time(us), in-place algbw(GB/s), in-place busbw(GB/s), in-place ideal busbw(GB/s), in-place busbw efficiency(%)
allgather, 1, float32, 8, Ring, Simple, 32, 10.800199, 172.5316047668457, 0.00018547326469978582, 0.0001622891066123126, 450.0, 3.606424591384725e-05
allgather, 2, float32, 8, Ring, Simple, 64, 10.800397, 96.2376594543457, 0.000665020329493373, 0.0005818927883067014, 450.0, 0.0001293095085126003
allgather, 24, float32, 8, Ring, Simple, 768, 10.804766, 96.7264175415039, 0.007939919822528962, 0.006947429844712841, 450.0, 0.0015438732988250759
allgather, 5571584, float16, 8, Ring, Simple, 89145344, 248.872696, 308.69245529174805, 288.7836825028585, 252.68572219000117, 450.0, 56.15238270888915
allgather, 8389632, float16, 8, Ring, Simple, 134234112, 358.46347, 424.4208335876465, 316.27597275400836, 276.7414761597573, 450.0, 61.49810581327941
allgather, 11141120, float16, 8, Ring, Simple, 178257920, 465.465759, 537.2524261474609, 331.79546768779625, 290.3210342268217, 450.0, 64.51578538373816
allgather, 25755648, float16, 8, Ring, Simple, 412090368, 1033.808472, 1153.3379554748535, 357.30235534504175, 312.63956092691154, 450.0, 69.47545798375812
allgather, 27853824, float16, 8, Ring, Simple, 445661184, 1115.404175, 1207.9715728759766, 368.93350307818577, 322.81681519341254, 450.0, 71.73707004298056
allreduce, 1, uint8, 8, Ring, Simple, 1, 15.000012, 110.00633239746094, 9.090385782401388e-06, 1.5908175119202428e-05, 450.0, 3.535150026489428e-06
allreduce, 1, float32, 8, Ring, Simple, 4, 15.00005, 106.7042350769043, 3.7486797005921126e-05, 6.560189476036198e-05, 450.0, 1.4578198835635995e-05
allreduce, 1, float64, 8, Ring, Simple, 8, 15.000099, 107.69367218017578, 7.428477307947753e-05, 0.00012999835288908567, 450.0, 2.8888522864241258e-05
allreduce, 2, float32, 8, Ring, Simple, 8, 15.000099, 105.70287704467773, 7.568384346453141e-05, 0.00013244672606292998, 450.0, 2.9432605791762215e-05
broadcast, 3, int64, 8, Ring, Simple, 24, 7.20017, 106.88304901123047, 0.00022454449252732545, 0.00022454449252732545, 450.0, 4.9898776117183436e-05
reducescatter, 5571584, float16, 8, Ring, Simple, 89145344, 248.872696, 319.0875053405762, 279.3758530433564, 244.45387141293685, 450.0, 54.32308253620819
reducescatter, 8389632, float16, 8, Ring, Simple, 134234112, 358.46347, 423.8605499267578, 316.69404482959163, 277.10728922589266, 450.0, 61.579397605753925
reducescatter, 11141120, float16, 8, Ring, Simple, 178257920, 465.465759, 520.8730697631836, 342.22909639203556, 299.4504593430311, 450.0, 66.54454652067359
reducescatter, 25755648, float16, 8, Ring, Simple, 412090368, 1033.808472, 1131.4988136291504, 364.1986743921262, 318.6738400931104, 450.0, 70.81640890958009
reducescatter, 27853824, float16, 8, Ring, Simple, 445661184, 1115.404175, 1190.030574798584, 374.49557468338946, 327.6836278479658, 450.0, 72.81858396621462