Skip to main content
Version: v2506

Troubleshooting

This guide explains common issues and their solutions during AIBooster setup and operation.

Data Not Displayed

GPU Compatibility Check

This feature uses NVIDIA DCGM, and the range of supported metrics varies depending on your GPU.

  • Full Support (Recommended): A100, H100, H200
    • All metrics are available.
  • Partial Support: GeForce GPU (RTX/GTX series)
    • Only basic metrics are supported. Some metrics like SM Activity cannot be obtained.

If GPU metrics are not displayed or some metrics (such as SM Activity) cannot be obtained, please check if your GPU is DCGM compatible. For details, please refer to the NVIDIA DCGM Official Documentation. Or contact your representative.

Firewall Restriction Removal

Restriction Removal on Server Node

Server components need to accept traffic on TCP ports 3000 and 9000. They are used for the following purposes:

  • 3000: HTTP access to Grafana dashboard
  • 9000: Receiving performance observation data to ClickHouse database

Port 3000 needs to accept traffic from users accessing the AIBooster performance observation dashboard. In contrast, port 9000 needs to accept traffic from compute nodes being observed. Please configure settings to allow these communications according to your environment.

Common configuration methods include:

  • Setting up SSH port forwarding
  • Configuring firewall (ufw)
  • Allowing through security groups

As an example, when using ufw, configure as follows:

sudo ufw limit 3000
sudo ufw limit 9000

You can also allow access only from specific IP addresses:

sudo ufw limit from 198.51.100.0 to any port 3000 proto tcp
sudo ufw limit from 198.51.100.0 to any port 9000 proto tcp

In this example, only access from 198.51.100.0 is allowed. Replace with your actual IP address.

If using a PC that can only connect within the local area as a Server and is not accessible from external networks, IP restriction settings are not necessary. However, in environments where unspecified devices exist on the same network or with strict security policies, we recommend implementing IP restrictions even within local networks.

Restriction Removal on Agent Nodes and Single Node

Agent components use TCP port 9100 for communication. If there are firewall restrictions, please allow the communication.

As an example, when configuring with ufw:

sudo ufw allow 9100

Service Restart

To restart AIBooster services, follow these procedures:

For Single Node Configuration

For single node configuration, restart in the following directory:

cd /opt/aibooster/local
docker compose down
docker compose up -d

For Multi-Node Configuration

For multi-node configuration, restart in the appropriate directories:

cd /opt/aibooster/server
docker compose down
docker compose up -d

cd /opt/aibooster/agent
docker compose down
docker compose up -d

cd /opt/aibooster/local
docker compose down
docker compose up -d