Skip to main content
Version: v2506

Visualizing Metrics

The AIBooster performance dashboard uses ClickHouse, a column-oriented database, for data storage and Grafana for visualization. By combining the powerful query capabilities provided by ClickHouse with the visualization panels from Grafana, you can analyze collected telemetry signals from various perspectives.

About the Standard Dashboard

AIBooster provides a standard dashboard for visualizing various observed metrics. From the left menu, select "Dashboards" and click "AIBooster Performance Dashboard".

At the top of this dashboard, the overall workload's GPU Utilization (how much of the GPU is being used) and GPU SM Activity (how well the workload is optimized for the GPU) are scored on a scale of 0-100.

grafana-3

The following panels display the following metrics as time series graphs:

  • GPU Utilization
  • GPU SM Activity
  • CPU Utilization
  • Memory Bandwidth
  • L2 Cache Hit Ratio
  • L3 Cache Hit Ratio
  • Network Bandwidth
  • Storage Bandwidth

grafana-4

Additionally, the Profile panel allows you to view the flame graph of the program running on the node. The flame graph is a panel for analyzing program bottlenecks at the source code level by reconstructing sampled stack trace information. For details on flame graphs, please refer to pages like this one.

grafana-5

The following Prometheus exporters are supported, and they collect various metrics in addition to those listed here.

For details, please refer to the official documentation linked.

Changing the Observed Node

In the upper left of the dashboard, there is a dropdown list labeled Host:.

grafana-node-list

Here, the names of the nodes being observed by AIBooster are displayed as a list. By selecting an observed node from this list, you can switch the node being visualized.

Changing the Observation Time Range

You can change the observation time range from the dropdown list in the upper right of the dashboard.

grafana-time-range

Here, you can specify the observation time range using relative time from the current time or absolute time. For example:

  • From 24 hours ago to now
  • From 12:00 AM on 04/01 to 5:00 PM on 04/07

Additionally, you can select the observation range by dragging and dropping on a time series graph visualization panel. This is an effective way to observe the trends of other metrics after focusing on the change in a particular metric.

Adding Custom Panels

In addition to the standard dashboard, you can add your own metric visualization panels. As explained in Performance Details / Host, AIBooster collects metrics from multiple Prometheus exporters, and you can create custom panels using this data.

Steps to Add a Panel

  1. In Performance Overview, click "Edit" → "Add" → "Visualization" in the upper right corner of the existing dashboard
  2. Configure "Panel Title", "query name", and "Visualization" as needed
  3. Select "ClickHouse" for Data source
  4. Select "SQL Editor" for Editor Type
  5. Enter a query like the examples below and click "Run Query"
  6. Click "Save dashboard" as needed
SELECT
toStartOfInterval(TimeUnix, INTERVAL 30 SECOND) AS time_bucket,
avg(Value) as Value
FROM otel_metrics_gauge
WHERE MetricName = 'metric_name'
AND TimeUnix BETWEEN $__fromTime AND $__toTime
GROUP BY time_bucket
ORDER BY time_bucket;

Available metric names can be found in the official documentation of each exporter:

Example: GPU Power Consumption Panel

Visualizing GPU Power Consumption (Watts):

SELECT
toStartOfInterval(TimeUnix, INTERVAL 30 SECOND) AS time_bucket,
avg(Value) as Value
FROM otel_metrics_gauge
WHERE MetricName = 'DCGM_FI_DEV_POWER_USAGE'
AND TimeUnix BETWEEN $__fromTime AND $__toTime
GROUP BY time_bucket
ORDER BY time_bucket;