Version: Next

NCCL Ideal Communication Bandwidth Estimation Method

The PyTorch NCCL benchmark results display the ideal communication bandwidth and how efficiently the current communication bandwidth achieves compared to the ideal communication bandwidth. This document explains the overview of the ideal communication bandwidth estimation method.

Notation

$`B`$ : Unidirectional communication bandwidth per GPU for communication paths connecting GPUs
$`I`$ : Unidirectional communication bandwidth per node for communication paths connecting nodes
$`S`$ : Size of data for collective communication (bytes)
$`N`$ $‘ N ‘$ : Total number of GPUs for benchmarking
- $`P`$ : Number of GPUs per node
- $`Q`$ : Number of nodes
- $`N = P×Q`$
$`T`$ : Measured execution time of collective communication

Basic Concept

Bus bandwidth (BusBW) in NCCL benchmarks is defined by the following formula:

\textrm{BusBW} = \frac{\textrm{Total amount of data to be transferred}}{T \times N}

For how to calculate the total amount of data to be transferred, please refer to NVIDIA's NCCL Tests documentation. For example, if it takes 0.1 seconds to perform AllReduce communication of 1GB of data using 4 nodes with 8 GPUs per node, BusBW is calculated as follows:

\textrm{BusBW} = \frac{S \times 2 \times (N - 1)}{T \times N} = \frac{1 \textrm{[GB]} \times 2 \times (16 - 1)}{0.1 \textrm{[s]} \times 16} = 19.375 \textrm{[GB/s]}

The ideal communication bandwidth is estimated by the following formula, replacing the measured execution time $`T`$ with the ideal communication time $`T_\textrm{Ideal}`$ determined by the program execution environment and the type of collective communication:

\textrm{Ideal communication bandwidth} = \frac{\textrm{Total amount of data to be transferred}}{T_\textrm{Ideal} \times N}

The ideal communication bandwidth is basically determined by the benchmark execution environment (communication hardware and execution process mapping) and does not depend on runtime parameters such as communication type, data type, or data size. However, available network features may change depending on the communication type or data type, and in such cases, the ideal communication bandwidth may vary with runtime parameters even in the same execution environment.

Estimation Method for Single Node Execution

When running benchmarks on a single node, the ideal communication bandwidth is estimated under the following assumptions:

Each GPU can simultaneously send to and receive from other GPUs in the node at speed $`B`$
Each GPU in the node is connected to the same network with full bisection bandwidth
The network to which each GPU is connected only performs communication (does not perform in-network computation)
Time spent on non-communication is negligibly short (e.g., computation time in AllReduce communication)

Under these assumptions, since all $`N`$ GPUs can always send and receive data at speed $`B`$ , the ideal communication time $`T_\textrm{Ideal}`$ is expressed by:

T_\textrm{Ideal} = \frac{\textrm{Total amount of data to be transferred}}{B \times N}

Substituting this into the ideal communication bandwidth estimation formula shown in the previous section yields the following evaluation of ideal communication bandwidth for single node execution:

\textrm{Ideal communication bandwidth} = B

For example, when running benchmarks on a single node in an environment where GPUs capable of simultaneous send/receive at $`B = 450 \textrm{[GB/s]}`$ are connected to a full bisection bandwidth network, the ideal communication bandwidth is estimated to be $`450 \textrm{[GB/s]}`$ regardless of the number of GPUs.

Estimation Method for Multi-Node Execution

When running benchmarks on multiple nodes, the ideal communication bandwidth is estimated under the following assumptions:

Single node assumptions are satisfied
Each node can simultaneously send to and receive from other nodes at speed $`I`$
Each node is connected to an inter-node network with full bisection bandwidth
The network to which each node is connected only performs communication (does not perform in-network computation)
Intra-node GPU communication and inter-node communication are independent and do not interfere with each other's communication performance

Under these assumptions, since the time for inter-node communication and intra-node communication are independent, the longer of these times is considered to be the execution time of the entire collective communication. Therefore, the ideal communication time $`T_\textrm{Ideal}`$ is expressed by:

T_\textrm{Ideal} = \max(\textrm{Ideal communication time for inter-node}, \textrm{Ideal communication time for intra-node})

The ideal communication times for inter-node and intra-node are the total amount of data to be transferred divided by the bandwidth, respectively:

\textrm{Ideal communication time for inter-node} = \frac{\textrm{Total data to be transferred between nodes}}{I \times Q}, \\ \textrm{Ideal communication time for intra-node} = \frac{\textrm{Total data to be transferred within nodes}}{B \times P}

Meanwhile, the total amount of data to be transferred between nodes and within nodes are:

\textrm{Total data to be transferred between nodes} = \textrm{Total data to be transferred in collective communication} \times \frac{Q - 1}{N - 1}, \\ \textrm{Total data to be transferred within nodes} = \textrm{Total data to be transferred in collective communication} \times \frac{N - Q}{N - 1}

Substituting these formulas into the ideal communication bandwidth estimation formula and simplifying yields:

\textrm{Ideal communication bandwidth} = \min\left(\frac{I \times (N - 1) \times Q}{N \times (Q - 1)}, \frac{B \times (N - 1)}{N - Q} \right)

For example, when executing collective communication in an environment where GPU connections are $`B = 450 \textrm{[GB/s]}`$ , inter-node connections are $`I = 100 \textrm{[GB/s]}`$ , GPUs per node $`P = 8`$ , and number of nodes $`Q = 2`$ , the ideal communication bandwidth is estimated as $`\min(187.5, 482.1) \textrm{[GB/s]} = 187.5 \textrm{[GB/s]}`$ .

Notation​

Basic Concept​

Estimation Method for Single Node Execution​

Estimation Method for Multi-Node Execution​

Notation

Basic Concept

Estimation Method for Single Node Execution

Estimation Method for Multi-Node Execution