Version: v2506

Case Study

Learn how to observe and analyze various metrics on AIBooster and connect them to performance improvements, using a case study of Llama4 Scout continued pre-training.

Environment and Configuration

Training Environment

High-Power PHY Sec.A x 4 nodes
- GPU: NVIDIA H100 80GB x 8
- Interconnect: 200Gbps x 4
- Ubuntu 22.04 LTS
Initial setup with AIBooster completed
- Performance observation dashboard/profiling tools
- Performance improvement framework
- Recommended infrastructure settings applied
- Model development kit

Training Configuration and Assumptions

Training library: LLaMA-Factory
Dataset: RedPajama-V1 ArXiv Subset (28B Token Count)
Using configuration values from the official sample (Llama3 full parameter SFT) with only model changes
After placing code, model, and data, run the following command on each node

FORCE_TORCHRUN=1 \
  NNODES=4 \
  NODE_RANK=<Node number from 0 to 3> \
  MASTER_ADDR=<Address of node 0> \
  MASTER_PORT=29500 \
  llamafactory-cli train examples/train_full/llama4_full_pt.yaml

Training of 3 epochs completed in approximately 28 hours

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 100892.0068, 'train_samples_per_second': 1.199, 'train_steps_per_second': 0.019, 'train_loss': 1.72523823162866, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1890/1890 [28:01:32<00:00, 53.38s/it]
[INFO|trainer.py:3984] 2025-04-28 08:27:36,306 >> Saving model checkpoint to saves/llama4-109b/full/pt

..snip..

[INFO|tokenization_utils_base.py:2519] 2025-04-28 08:32:00,528 >> Special tokens file saved in saves/llama4-109b/full/pt/special_tokens_map.json
***** train metrics *****
  epoch                    =             2.996
  total_flos               =         1557481GF
  train_loss               =            1.7252
  train_runtime            = 1 day, 4:01:32.00
  train_samples_per_second =             1.199
  train_steps_per_second   =             0.019
Figure saved at: saves/llama4-109b/full/pt/training_loss.png
[WARNING|2025-04-28 08:32:00] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[INFO|modelcard.py:450] 2025-04-28 08:32:00,937 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Analysis of Collected Metrics

Results from observing time-series metrics in the AIBooster performance observation dashboard:

insight-1

Average GPU Utilization: 99.1% (seemingly high)
Average GPU SM Activity: 21.3% (GPU core utilization efficiency is low at around 20% and flat)
GPU Utilization, GPU SM Activity: GPU stalls at regular intervals
Network Send/Recv Bandwidth: Average interconnect bandwidth usage is about 6GB/s for both Send/Recv (theoretical bandwidth is 25GB/s)

insight-2

Storage Write Bandwidth: Storage writes occur in sync with GPU stalls
Storage Read Bandwidth: Intermittent storage reads occur
CPU Utilization: CPU has capacity to spare

Flame Graph Analysis

flame graph

In the profile panel, clicking the Flame Graph button displays an enlarged flame graph. Notice the pt_autograd_#N process in the third row from the top. This PyTorch Engine thread accounts for most of the processing time. Looking at the lower section of each pt_autograd_#N, the processing is divided into two parts. First, let's examine the left-side processing. Click on the process surrounded by the white line in the figure below, then click Focus Block from the tab that appears to check the details.

flame graph 2

Looking at the figure below, it's blocked within DeepSpeed's fetch_sub_module function. From this code, we can predict that it's waiting for the arrival of parameters distributed across each node.

flame graph 3

Next, let's examine the right-side processing in the lower section of pt_autograd_#N. Click on the process surrounded by the white line in the figure below and click Focus Block to check the details.

flame graph 4

It's blocked within DeepSpeed ZeRo3's _reduce_and_partition_ipg_grads function. From this code, we can predict that it's waiting for the completion of model gradient aggregation.

flame graph 5

Analysis Summary

From the above metrics and flame graph analysis, we found:

GPU is not being utilized to its full potential
There are GPU stalls that appear to be checkpoint writing, but the overall impact is minimal
Combining flame graph and code information, it's likely blocking on aggregation of distributed data across nodes
Meanwhile, there's spare capacity in interconnect bandwidth

Based on these observations, we can hypothesize that communication efficiency in distributed training is the bottleneck. Specifically, in DeepSpeed ZeRO3's parameter distribution and aggregation processing, communication wait time exceeds computation time.

Improvement Measure: Optimize DeepSpeed Hyperparameters

AIBooster includes tools for automatically adjusting various performance parameters of DeepSpeed.

We search for optimal values of DeepSpeed ZeRO3 configuration parameter sets to minimize execution time.

For detailed usage, please see the Performance Improvement Guide.

Optimization Results

Parameter Changes:

Parameter Name	Before Optimization	After Optimization
overlap_comm	false	false
contiguous_gradients	true	false
sub_group_size	1,000,000,000	14,714,186
reduce_bucket_size	26,214,400	1
stage3_prefetch_bucket_size	23,592,960	473,451
stage3_param_persistence_threshold	51,200	4,304,746
stage3_max_live_parameters	1,000,000,000	6,914,199,685
stage3_max_reuse_distance	1,000,000,000	3,283,215,516

Performance Improvement Results:

insight-after

Processing time: 50[s/itr] → 40[s/itr] (approximately 20% improvement)
Communication bandwidth: 6.0 → 7.5[GiB/s] (approximately 25% improvement)
Average GPU SM Activity: 21% → 28% (approximately 7% improvement)
No degradation observed in loss curve

loss function

The optimal parameters obtained from approximately 40 trials significantly improved training performance.

Conclusion

By analyzing dashboard metrics and flame graphs, you can identify what to focus on for hyperparameter tuning (observation). Then, using PI to search for optimal parameter set values, you can achieve performance improvements (improvement). In performance engineering, this loop of observation and improvement is crucial.

cycle

What becomes a bottleneck varies by environment and application. Start by simply observing performance with the dashboard.

Reference Links

DeepSpeed ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Environment and Configuration​

Analysis of Collected Metrics​

Flame Graph Analysis​

Analysis Summary​

Improvement Measure: Optimize DeepSpeed Hyperparameters​

Optimization Results​

Conclusion​

Reference Links​