Skip to main content
Version: v2511

Changelog

v2511

New Features

  • Observation (PO)
    • Revamped slurm-job-tracer, pytorchjob-tracer, and process-tracer with new job information format
    • Database version management and automatic migration functionality
    • Grafana library panel support on Kubernetes
  • Improvement (PI)
    • Job routing functionality for autonomous tuning
    • AcuiRT end-to-end workflow (conversion, accuracy evaluation, inference speed evaluation, profile acquisition, report output)

Improvements

  • Preemption support and automatic namespace configuration for pytorchjob-tracer
  • Monthly release workflow for Python packages
  • Improved hostname resolution for helm-based faib-agent in Kubernetes environments
  • Added 10-second polling functionality to Portal frontend
  • Stricter AcuiRT conversion eligibility checks
  • Added GKE environment setup guide
  • Document tree structure changes and build process improvements
  • Removed dynolog-agent container dependency
  • Other minor bug fixes and performance improvements

v2510

New Features

This release does not include any new features.

Improvements

  • Automatic recovery when some GPU metrics cannot be collected
  • Enhanced logging information for PyTorchJob Tracer
  • Countermeasures against ECR rate limiting
  • SaaS functionality UI/UX improvements
  • Improved installer error messages
  • Other minor bug fixes

v2509

New Features

  • Started providing SaaS version
  • Observation (PO)
    • Redesigned UI for GPU detailed profiling functionality
    • Added Metrics API functionality, supporting programmatic metrics retrieval
  • Improvement (PI)
    • Started package distribution on PyPI
    • Inference optimization for PyTorch models with AcuiRT (automatic model conversion for TensorRT)
    • Autonomous optimization with ZenithTune (automatic tuning of jobs with free parameter definitions)

Improvements

  • Enhanced ClickHouse security features with TLS verification and HTTP-only mode support
  • Strengthened authentication for Agent-Server communication
  • Added systemd unit file generation functionality to improve service management
  • Fixed panel display issues
  • Other minor bug fixes and performance improvements

v2508

New Features

  • Added general-purpose job tuning functionality in Kubernetes environments (PyTorchJobTuner)
  • Added support for Slurm Array Jobs

Improvements

  • Improved handling of additional job states such as REQUEUED state in Slurm jobs
  • Added automatic HOSTNAME configuration at Agent startup
  • Stabilized pcm-exporter in Intel PCM non-supported environments
  • Removed unnecessary port exposure on Agent nodes to improve security
  • Fixed log transmission failure issues in slurm-job-tracer
  • Other minor bug fixes and performance improvements

v2507

New Features

This release does not include any new features.

Improvements

  • Improved Grafana dashboard and AIBooster Profile Analyzer Plugin (new detailed trace display, improved process monitoring, etc.)
  • Changed ClickHouse connection user from default to environment variable-based
  • Improved span termination when PyTorchJob is deleted
  • Improved Dynolog database connection handling and added retry functionality
  • Fixed ClickHouse-related errors
  • Other minor bug fixes and performance improvements

v2506

New Features

  • Observation (PO)
    • Functionality to track process groups matching specific conditions as jobs
    • Functionality to track Kubernetes PyTorchJobs in real-time
    • Job monitoring and GPU resource tracking functionality for Slurm environments
    • Detailed GPU profiling functionality through PyTorch tracing
    • Agent functionality to collect Lustre filesystem metrics
  • Improvement (PI)
    • Agent functionality to trace hyperparameter tuning results
    • Automatic tuning functionality and CPU Affinity optimization for MMEngine and DeepSpeed

Improvements

  • NCCL benchmark support for H200 GPUs
  • Infrastructure setup support for H200 environments
  • Added automatic restart configuration for node-exporter
  • Fixed multiple issues related to Slurm deployment
  • Other minor bug fixes and performance improvements