Troubleshooting
This section explains common problems during AIBooster setup and operation, and their solutions.
Data Not Displayed
GPU Compatibility Check
This feature uses NVIDIA DCGM, and the range of supported metrics varies depending on your GPU.
- Full support (recommended): A100, H100, H200
- All metrics are available.
- Partial support: GeForce GPU (RTX/GTX series)
- Only basic metrics are supported. Some metrics such as SM Activity cannot be acquired.
If GPU metrics are not displayed, or if some metrics (such as SM Activity) cannot be acquired, please check if your GPU is compatible with DCGM. For details, refer to the NVIDIA DCGM official documentation. Or contact the representative.
Firewall Restriction Removal
Restriction Removal on Server Node
The Server component needs to accept traffic on TCP ports 3000, 8123, and 16697. They are used for the following purposes:
- 3000: HTTP access to Grafana dashboard
- 8123: Reception of performance observation data to ClickHouse database
- 16697: Application server for Server-Agent communication, etc.
Port 3000 needs to accept traffic from users accessing AIBooster's performance observation dashboard. On the other hand, ports 8123 and 16697 need to accept traffic from compute nodes being observed. Please configure your environment to allow these communications.
Representative configuration methods include:
- Configure SSH port forwarding
- Configure firewall (ufw)
- Allow in security group
As an example, if using ufw, configure as follows:
sudo ufw limit 3000
sudo ufw limit 8123
sudo ufw limit 16697
You can also allow access only from specific IP addresses.
sudo ufw limit from 198.51.100.0 to any port 3000 proto tcp
sudo ufw limit from 198.51.100.0 to any port 8123 proto tcp
sudo ufw limit from 198.51.100.0 to any port 16697 proto tcp
In this example, access is only allowed from 198.51.100.0.
Replace with your actual IP address.
If you use a PC that can only connect within a local area as a Server and is not accessed from external networks, IP restriction configuration is unnecessary. However, if unspecified devices exist on the same network or in environments with strict security policies, we recommend implementing IP restrictions even within local networks.
Restriction Removal on Agent Node
The Agent component uses TCP ports 26690 to 26699 for internal communication. These ports are only used within the same node and are not accessed from external nodes.
Service Restart
To restart AIBooster services, execute the following procedure.
Server Service Restart
Restart in the following directory:
cd /opt/aibooster/server
docker compose down
docker compose up -d
Agent Service Restart
The Agent service runs as systemd's aibooster-agent.service. Restart this service:
sudo systemctl restart aibooster-agent
Changing Metric Collection Intervals
Modifying the metric collection interval allows you to reduce agent load and server data volume.
You can change the collection interval using the following command. Please configure the server address to match your environment.
curl -X POST -H "Content-Type: application/json" -d '{"scrape_interval": <collection interval (seconds, number)>}' http://<server_address>:16697/api/v1/agents/config
Example
curl -X POST -H "Content-Type: application/json" -d '{"scrape_interval": 30}' http://192.168.100.100:16697/api/v1/agents/config
To revert to the default setting, specify null.
curl -X POST -H "Content-Type: application/json" -d '{"scrape_interval": null}' http://<server_address>:16697/api/v1/agents/config