Skip to main content
Version: v2510

Changelog

v2510

New Features

This release does not include any new features.

Improvements

  • Automatic recovery when some GPU metrics cannot be collected
  • Enhanced logging information for PyTorchJob Tracer
  • Countermeasures against ECR rate limiting
  • SaaS functionality UI/UX improvements
  • Improved installer error messages
  • Other minor bug fixes

v2509

New Features

  • Started providing SaaS version
  • Observation (PO)
    • Redesigned UI for GPU detailed profiling functionality
    • Added Metrics API functionality, supporting programmatic metrics retrieval
  • Improvement (PI)
    • Started package distribution on PyPI
    • Inference optimization for PyTorch models with AcuiRT (automatic model conversion for TensorRT)
    • Autonomous optimization with ZenithTune (automatic tuning of jobs with free parameter definitions)

Improvements

  • Enhanced ClickHouse security features with TLS verification and HTTP-only mode support
  • Strengthened authentication for Agent-Server communication
  • Added systemd unit file generation functionality to improve service management
  • Fixed panel display issues
  • Other minor bug fixes and performance improvements

v2508

New Features

  • Added general-purpose job tuning functionality in Kubernetes environments (PyTorchJobTuner)
  • Added support for Slurm Array Jobs

Improvements

  • Improved handling of additional job states such as REQUEUED state in Slurm jobs
  • Added automatic HOSTNAME configuration at Agent startup
  • Stabilized pcm-exporter in Intel PCM non-supported environments
  • Removed unnecessary port exposure on Agent nodes to improve security
  • Fixed log transmission failure issues in slurm-job-tracer
  • Other minor bug fixes and performance improvements

v2507

New Features

This release does not include any new features.

Improvements

  • Improved Grafana dashboard and AIBooster Profile Analyzer Plugin (new detailed trace display, improved process monitoring, etc.)
  • Changed ClickHouse connection user from default to environment variable-based
  • Improved span termination when PyTorchJob is deleted
  • Improved Dynolog database connection handling and added retry functionality
  • Fixed ClickHouse-related errors
  • Other minor bug fixes and performance improvements

v2506

New Features

  • Observation (PO)
    • Functionality to track process groups matching specific conditions as jobs
    • Functionality to track Kubernetes PyTorchJobs in real-time
    • Job monitoring and GPU resource tracking functionality for Slurm environments
    • Detailed GPU profiling functionality through PyTorch tracing
    • Agent functionality to collect Lustre filesystem metrics
  • Improvement (PI)
    • Agent functionality to trace hyperparameter tuning results
    • Automatic tuning functionality and CPU Affinity optimization for MMEngine and DeepSpeed

Improvements

  • NCCL benchmark support for H200 GPUs
  • Infrastructure setup support for H200 environments
  • Added automatic restart configuration for node-exporter
  • Fixed multiple issues related to Slurm deployment
  • Other minor bug fixes and performance improvements