Changelog
v2510
New Features
This release does not include any new features.
Improvements
- Automatic recovery when some GPU metrics cannot be collected
- Enhanced logging information for PyTorchJob Tracer
- Countermeasures against ECR rate limiting
- SaaS functionality UI/UX improvements
- Improved installer error messages
- Other minor bug fixes
v2509
New Features
- Started providing SaaS version
- Observation (PO)
- Redesigned UI for GPU detailed profiling functionality
- Added Metrics API functionality, supporting programmatic metrics retrieval
- Improvement (PI)
- Started package distribution on PyPI
- Inference optimization for PyTorch models with AcuiRT (automatic model conversion for TensorRT)
- Autonomous optimization with ZenithTune (automatic tuning of jobs with free parameter definitions)
Improvements
- Enhanced ClickHouse security features with TLS verification and HTTP-only mode support
- Strengthened authentication for Agent-Server communication
- Added systemd unit file generation functionality to improve service management
- Fixed panel display issues
- Other minor bug fixes and performance improvements
v2508
New Features
- Added general-purpose job tuning functionality in Kubernetes environments (PyTorchJobTuner)
- Added support for Slurm Array Jobs
Improvements
- Improved handling of additional job states such as REQUEUED state in Slurm jobs
- Added automatic HOSTNAME configuration at Agent startup
- Stabilized pcm-exporter in Intel PCM non-supported environments
- Removed unnecessary port exposure on Agent nodes to improve security
- Fixed log transmission failure issues in slurm-job-tracer
- Other minor bug fixes and performance improvements
v2507
New Features
This release does not include any new features.
Improvements
- Improved Grafana dashboard and AIBooster Profile Analyzer Plugin (new detailed trace display, improved process monitoring, etc.)
- Changed ClickHouse connection user from default to environment variable-based
- Improved span termination when PyTorchJob is deleted
- Improved Dynolog database connection handling and added retry functionality
- Fixed ClickHouse-related errors
- Other minor bug fixes and performance improvements
v2506
New Features
- Observation (PO)
- Functionality to track process groups matching specific conditions as jobs
- Functionality to track Kubernetes PyTorchJobs in real-time
- Job monitoring and GPU resource tracking functionality for Slurm environments
- Detailed GPU profiling functionality through PyTorch tracing
- Agent functionality to collect Lustre filesystem metrics
- Improvement (PI)
- Agent functionality to trace hyperparameter tuning results
- Automatic tuning functionality and CPU Affinity optimization for MMEngine and DeepSpeed
Improvements
- NCCL benchmark support for H200 GPUs
- Infrastructure setup support for H200 environments
- Added automatic restart configuration for node-exporter
- Fixed multiple issues related to Slurm deployment
- Other minor bug fixes and performance improvements