Skip to main content
Version: v2506

Case Study

Learn how to observe and analyze various metrics on AIBooster and connect them to performance improvements, using a case study of Llama4 Scout continued pre-training.

Environment and Configuration

Training Environment

  • High-Power PHY Sec.A x 4 nodes
    • GPU: NVIDIA H100 80GB x 8
    • Interconnect: 200Gbps x 4
    • Ubuntu 22.04 LTS
  • Initial setup with AIBooster completed
    • Performance observation dashboard/profiling tools
    • Performance improvement framework
    • Recommended infrastructure settings applied
    • Model development kit

Training Configuration and Assumptions

  • Training library: LLaMA-Factory
  • Dataset: RedPajama-V1 ArXiv Subset (28B Token Count)
  • Using configuration values from the official sample (Llama3 full parameter SFT) with only model changes
  • After placing code, model, and data, run the following command on each node
FORCE_TORCHRUN=1 \
NNODES=4 \
NODE_RANK=<Node number from 0 to 3> \
MASTER_ADDR=<Address of node 0> \
MASTER_PORT=29500 \
llamafactory-cli train examples/train_full/llama4_full_pt.yaml
  • Training of 3 epochs completed in approximately 28 hours
Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 100892.0068, 'train_samples_per_second': 1.199, 'train_steps_per_second': 0.019, 'train_loss': 1.72523823162866, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1890/1890 [28:01:32<00:00, 53.38s/it]
[INFO|trainer.py:3984] 2025-04-28 08:27:36,306 >> Saving model checkpoint to saves/llama4-109b/full/pt

..snip..

[INFO|tokenization_utils_base.py:2519] 2025-04-28 08:32:00,528 >> Special tokens file saved in saves/llama4-109b/full/pt/special_tokens_map.json
***** train metrics *****
epoch = 2.996
total_flos = 1557481GF
train_loss = 1.7252
train_runtime = 1 day, 4:01:32.00
train_samples_per_second = 1.199
train_steps_per_second = 0.019
Figure saved at: saves/llama4-109b/full/pt/training_loss.png
[WARNING|2025-04-28 08:32:00] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[INFO|modelcard.py:450] 2025-04-28 08:32:00,937 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Analysis of Collected Metrics

Results from observing time-series metrics in the AIBooster performance observation dashboard:

insight-1

  • Average GPU Utilization: 99.1% (seemingly high)
  • Average GPU SM Activity: 21.3% (GPU core utilization efficiency is low at around 20% and flat)
  • GPU Utilization, GPU SM Activity: GPU stalls at regular intervals
  • Network Send/Recv Bandwidth: Average interconnect bandwidth usage is about 6GB/s for both Send/Recv (theoretical bandwidth is 25GB/s)

insight-2

  • Storage Write Bandwidth: Storage writes occur in sync with GPU stalls
  • Storage Read Bandwidth: Intermittent storage reads occur
  • CPU Utilization: CPU has capacity to spare

Flame Graph Analysis

flame graph

In the profile panel, clicking the Flame Graph button displays an enlarged flame graph. Notice the pt_autograd_#N process in the third row from the top. This PyTorch Engine thread accounts for most of the processing time. Looking at the lower section of each pt_autograd_#N, the processing is divided into two parts. First, let's examine the left-side processing. Click on the process surrounded by the white line in the figure below, then click Focus Block from the tab that appears to check the details.

flame graph 2

Looking at the figure below, it's blocked within DeepSpeed's fetch_sub_module function. From this code, we can predict that it's waiting for the arrival of parameters distributed across each node.

flame graph 3

Next, let's examine the right-side processing in the lower section of pt_autograd_#N. Click on the process surrounded by the white line in the figure below and click Focus Block to check the details.

flame graph 4

It's blocked within DeepSpeed ZeRo3's _reduce_and_partition_ipg_grads function. From this code, we can predict that it's waiting for the completion of model gradient aggregation.

flame graph 5

Analysis Summary

From the above metrics and flame graph analysis, we found:

  • GPU is not being utilized to its full potential
  • There are GPU stalls that appear to be checkpoint writing, but the overall impact is minimal
  • Combining flame graph and code information, it's likely blocking on aggregation of distributed data across nodes
  • Meanwhile, there's spare capacity in interconnect bandwidth

Based on these observations, we can hypothesize that communication efficiency in distributed training is the bottleneck. Specifically, in DeepSpeed ZeRO3's parameter distribution and aggregation processing, communication wait time exceeds computation time.

Improvement Measure: Optimize DeepSpeed Hyperparameters

AIBooster includes tools for automatically adjusting various performance parameters of DeepSpeed.

We search for optimal values of DeepSpeed ZeRO3 configuration parameter sets to minimize execution time.

For detailed usage, please see the Performance Improvement Guide.

PI

Optimization Results

Parameter Changes:

Parameter NameBefore OptimizationAfter Optimization
overlap_commfalsefalse
contiguous_gradientstruefalse
sub_group_size1,000,000,00014,714,186
reduce_bucket_size26,214,4001
stage3_prefetch_bucket_size23,592,960473,451
stage3_param_persistence_threshold51,2004,304,746
stage3_max_live_parameters1,000,000,0006,914,199,685
stage3_max_reuse_distance1,000,000,0003,283,215,516

Performance Improvement Results:

insight-after

  • Processing time: 50[s/itr] → 40[s/itr] (approximately 20% improvement)
  • Communication bandwidth: 6.0 → 7.5[GiB/s] (approximately 25% improvement)
  • Average GPU SM Activity: 21% → 28% (approximately 7% improvement)
  • No degradation observed in loss curve

loss function

The optimal parameters obtained from approximately 40 trials significantly improved training performance.

Conclusion

By analyzing dashboard metrics and flame graphs, you can identify what to focus on for hyperparameter tuning (observation). Then, using PI to search for optimal parameter set values, you can achieve performance improvements (improvement). In performance engineering, this loop of observation and improvement is crucial.

cycle

What becomes a bottleneck varies by environment and application. Start by simply observing performance with the dashboard.