Version: v2602

NCCL Ideal Communication Bandwidth Estimation Method

The PyTorch NCCL benchmark results display the ideal communication bandwidth and how efficiently the current bus bandwidth (BusBW) achieves compared to the ideal communication bandwidth. This document explains the overview of the ideal communication bandwidth estimation method.

Notation

$B$ : Unidirectional communication bandwidth per GPU for communication paths connecting GPUs
$I$ : Unidirectional communication bandwidth per node for communication paths connecting nodes
$S$ : Size of data for collective communication (bytes)
$N$ $N$ : Total number of GPUs for benchmarking
- $P$ : Number of GPUs per node
- $Q$ : Number of nodes
- $N = P \times Q$
$T$ : Measured execution time of collective communication
$D$ : Overall data volume to be transferred in collective communication (bytes)
$D_\textrm{Inter}$ : Overall data volume to be transferred in inter node communication in the collective communication (bytes)
$D_\textrm{Intra}$ : Overall data volume to be transferred in intra node communication in the collective communication(bytes)

Basic Concept

BusBW in NCCL benchmarks is defined by the following formula:

\textrm{BusBW} = \frac{D}{T \times N}

For how to calculate $D$ , please refer to NVIDIA's NCCL Tests documentation. For example, if it takes 0.1 seconds to perform AllReduce communication of 1GB of data using 2 nodes with 8 GPUs per node, BusBW is calculated as follows:

\textrm{BusBW} = \frac{D}{T \times N} = \frac{S \times 2 \times (N - 1)}{T \times N} = \frac{1 \textrm{[GB]} \times 2 \times (16 - 1)}{0.1 \textrm{[s]} \times 16} = 19.375 \textrm{[GB/s]}

The BusBW calculated by this formula can be interpreted as representing the average bandwidth per GPU, determined by dividing the total bandwidth of the entire system comprised of multiple GPUs (calculated as $D / T$ ) by $N$ .

Let $T_\textrm{Ideal}$ denote the ideal communication time, defined as the completion time for a collective communication operation under optimal conditions. Consequently, the theoretical upper bound on BusBW (representing the ideal communication bandwidth) is obtained by substituting $T$ with $T_\textrm{Ideal}$ in the BusBW equation.

\textrm{Ideal communication bandwidth} = \frac{D}{T_\textrm{Ideal} \times N}

This equation, like the BusBW formula, can be interpreted as determining the upper bound on the total system bandwidth for a system with multiple GPUs via $D / T_\textrm{Ideal}$ , and subsequently dividing that value by $N$ to represent the upper bound on the average bandwidth per GPU.

The $T_\textrm{Ideal}$ considered here varies based on factors within the benchmark environment, and so on. The following sections will detail the estimation of $T_\textrm{Ideal}$ and the ideal communication bandwidth assumed by AIBooster.

Estimation Method for Single Node Execution

When running benchmarks on a single node, the ideal communication bandwidth is estimated under the following assumptions:

Each GPU can simultaneously send to and receive from other GPUs in the node at speed $B$
Each GPU in the node is connected to the same network with full bisection bandwidth
The network to which each GPU is connected only performs communication (does not perform in-network computation)
Time spent on non-communication is negligibly short (e.g., computation time in AllReduce communication)

Under these assumptions, since all $N$ GPUs can always send and receive data at speed $B$ , the ideal communication time $T_\textrm{Ideal}$ is expressed by:

T_\textrm{Ideal} = \frac{D}{B \times N}

Substituting this into the ideal communication bandwidth estimation formula shown in the previous section yields the following evaluation of ideal communication bandwidth for single node execution:

\textrm{Ideal communication bandwidth} = B

For example, when running benchmarks on a single node in an environment where GPUs capable of simultaneous send/receive at $B = 450 \textrm{[GB/s]}$ are connected to a full bisection bandwidth network, the ideal communication bandwidth is estimated to be $450 \textrm{[GB/s]}$ regardless of the number of GPUs.

Estimation Method for Multi-Node Execution

When running benchmarks on multiple nodes, the ideal communication bandwidth is estimated under the following assumptions:

Single node assumptions are satisfied
Each node can simultaneously send to and receive from other nodes at speed $I$
Each node is connected to an inter-node network with full bisection bandwidth
The network to which each node is connected only performs communication (does not perform in-network computation)
Intra-node GPU communication and inter-node communication do not interfere with each other's communication performance
The shorter communication time (intra-node GPU communication or inter-node communication) is hidden by the longer communication time

Under these assumptions, the longer communication time, whether intra-node or inter-node, is considered the execution time of the entire collective communication. Therefore, the ideal communication time $T_\textrm{Ideal}$ is expressed by:

T_\textrm{Ideal} = \max(\textrm{Ideal communication time for inter-node}, \textrm{Ideal communication time for intra-node})

The ideal communication times for inter-node and intra-node are the total amount of data to be transferred in each communication divided by the inter-node and intra-node bandwidth, respectively:

\textrm{Ideal communication time for inter-node} = \frac{D_\textrm{Inter}}{I \times Q}, \\ \textrm{Ideal communication time for intra-node} = \frac{D_\textrm{Intra}}{B \times N}

Next, we consider the proportion of $D_\textrm{Inter}$ and $D_\textrm{Intra}$ within $D$ . Collective communications handled by NCCL consist of operations where each GPU exchanges data with $N - 1$ other GPUs. For example, in AllGather, each GPU needs to send its data to the remaining $N - 1$ GPUs, while in Broadcast, the root GPU needs to send its data to the other $N - 1$ GPUs. On the other hand, in Reduce, the root GPU needs to receive data from the other $N - 1$ GPUs. Furthermore, since AllReduce is a collective communication that combines reduction and broadcast operations, it requires performing two sets of such operations. We count the number of inter-node and intra-node communications required when exchanging data with the $N - 1$ other GPUs. Here, we consider the operation of a GPU sending data to the other $N - 1$ GPUs, but the same applies to receiving data. First, data can be transferred via intra-node communication only to GPUs belonging to the same node as the sending GPU. On the other hand, for GPUs belonging to different nodes from the sending GPU, data can be sent via inter-node communication to one GPU per node, and the remaining GPUs can receive the data through intra-node communication from GPUs that have already received it. Therefore, the minimum number of inter-node communications required is $Q - 1$ , and the remaining $(N - 1) - (Q - 1) = N - Q$ communications can be handled via intra-node communication. Although it's possible to always use inter-node communication for GPUs on different nodes, we assume here that the minimal number of inter-node communications are carried out. Therefore, the proportions of $D_\textrm{Inter}$ and $D_\textrm{Intra}$ within $D$ are:

D_\textrm{Inter} = D \times \frac{Q - 1}{N - 1}, \\ D_\textrm{Intra} = D \times \frac{N - Q}{N - 1}

Substituting these formulas into the formula on $T_\textrm{Ideal}$ yields:

T_\textrm{Ideal} = \max\left( \frac{D_\textrm{Inter}}{I \times Q}, \frac{D_\textrm{Intra}}{B \times N} \right) \\ = \max\left( \frac{D \times (Q - 1) / (N - 1)}{I \times Q}, \frac{D \times (N - Q) / (N - 1)}{B \times N} \right)

Substituting this into the ideal communication bandwidth estimation formula shown in the previous section yields the following evaluation of ideal communication bandwidth for multi node execution:

\textrm{Ideal communication bandwidth} = \frac{D} { \max\left( \displaystyle \frac{D \times (Q - 1) / (N - 1)}{I \times Q}, \frac{D \times (N - Q) / (N - 1)}{B \times N} \right) \times N } \\ = \min\left(\frac{I \times (N - 1) \times Q}{N \times (Q - 1)}, \frac{B \times (N - 1)}{N - Q} \right)

For example, when executing collective communication in an environment where GPU connections are $B = 450 \textrm{[GB/s]}$ , inter-node connections are $I = 100 \textrm{[GB/s]}$ , GPUs per node $P = 8$ , and number of nodes $Q = 2$ , the ideal communication bandwidth is estimated as $\min(187.5, 482.1) \textrm{[GB/s]} = 187.5 \textrm{[GB/s]}$ .

Notation​

Basic Concept​

Estimation Method for Single Node Execution​

Estimation Method for Multi-Node Execution​

Notation

Basic Concept

Estimation Method for Single Node Execution

Estimation Method for Multi-Node Execution