NCCL Ideal Communication Bandwidth Estimation Method
The PyTorch NCCL benchmark results display the ideal communication bandwidth and how efficiently the current communication bandwidth achieves compared to the ideal communication bandwidth. This document explains the overview of the ideal communication bandwidth estimation method.
Notation
- : Unidirectional communication bandwidth per GPU for communication paths connecting GPUs
- : Unidirectional communication bandwidth per node for communication paths connecting nodes
- : Size of data for collective communication (bytes)
- : Total number of GPUs for benchmarking
- : Number of GPUs per node
- : Number of nodes
- : Measured execution time of collective communication
Basic Concept
Bus bandwidth (BusBW) in NCCL benchmarks is defined by the following formula:
For how to calculate the total amount of data to be transferred, please refer to NVIDIA's NCCL Tests documentation. For example, if it takes 0.1 seconds to perform AllReduce communication of 1GB of data using 4 nodes with 8 GPUs per node, BusBW is calculated as follows:
The ideal communication bandwidth is estimated by the following formula, replacing the measured execution time with the ideal communication time determined by the program execution environment and the type of collective communication:
The ideal communication bandwidth is basically determined by the benchmark execution environment (communication hardware and execution process mapping) and does not depend on runtime parameters such as communication type, data type, or data size. However, available network features may change depending on the communication type or data type, and in such cases, the ideal communication bandwidth may vary with runtime parameters even in the same execution environment.
Estimation Method for Single Node Execution
When running benchmarks on a single node, the ideal communication bandwidth is estimated under the following assumptions:
- Each GPU can simultaneously send to and receive from other GPUs in the node at speed
- Each GPU in the node is connected to the same network with full bisection bandwidth
- The network to which each GPU is connected only performs communication (does not perform in-network computation)
- Time spent on non-communication is negligibly short (e.g., computation time in AllReduce communication)
Under these assumptions, since all GPUs can always send and receive data at speed , the ideal communication time is expressed by:
Substituting this into the ideal communication bandwidth estimation formula shown in the previous section yields the following evaluation of ideal communication bandwidth for single node execution:
For example, when running benchmarks on a single node in an environment where GPUs capable of simultaneous send/receive at are connected to a full bisection bandwidth network, the ideal communication bandwidth is estimated to be regardless of the number of GPUs.
Estimation Method for Multi-Node Execution
When running benchmarks on multiple nodes, the ideal communication bandwidth is estimated under the following assumptions:
- Single node assumptions are satisfied
- Each node can simultaneously send to and receive from other nodes at speed
- Each node is connected to an inter-node network with full bisection bandwidth
- The network to which each node is connected only performs communication (does not perform in-network computation)
- Intra-node GPU communication and inter-node communication are independent and do not interfere with each other's communication performance
Under these assumptions, since the time for inter-node communication and intra-node communication are independent, the longer of these times is considered to be the execution time of the entire collective communication. Therefore, the ideal communication time is expressed by:
The ideal communication times for inter-node and intra-node are the total amount of data to be transferred divided by the bandwidth, respectively:
Meanwhile, the total amount of data to be transferred between nodes and within nodes are:
Substituting these formulas into the ideal communication bandwidth estimation formula and simplifying yields:
For example, when executing collective communication in an environment where GPU connections are , inter-node connections are , GPUs per node , and number of nodes , the ideal communication bandwidth is estimated as .