Skip to main content
Version: v2602

Customize Visualization and Telemetry

AIBooster PO (hereafter referred to as PO) is built on top of OSS such as OpenTelemetry and Grafana, and can be flexibly customized from telemetry data collection to dashboard visualization. This page describes how to customize PO.

Creating Grafana Dashboards

You can also build your own dashboards on Grafana. To add a custom dashboard, click "New" in the upper right corner of the dashboard list, then select "New dashboard".

add new dashboard

PO Standard Dashboards

The content of dashboards provided by PO (PO standard dashboards) may change in future updates. If you directly edit a PO standard dashboard, future updates may not be reflected. If you want to add custom panels to a PO standard dashboard, please copy the dashboard first and edit the copy rather than editing the PO standard dashboard directly.

PO Panel Library

PO standard dashboards display only representative metrics. If you want to view other detailed metrics, you can easily add them using the Panel Library with the following steps:

  1. Switch to "Edit" mode on the dashboard
  2. Click "Add"
  3. Click "Import from library"
  4. Click the desired panel from the library panel list
  5. After adding the panel, click "Save dashboard" in the upper right corner if needed

grafana-panel-1 grafana-panel-2 grafana-panel-3 grafana-panel-4

Available Panels

PO provides the following types of panels:

  • GPU-related metrics (DCGM): GPU utilization, temperature, power consumption, memory usage, profiling information, etc.
  • System metrics (Node Exporter): CPU load, memory, filesystem, network, etc.
  • Process/Application metrics: Process information, scraping statistics, etc.

For a complete list of metrics and detailed descriptions, see Metrics Details.

Adding Custom Telemetry Data

You can also collect your own telemetry data that PO does not collect by default and display it on dashboards. This requires two steps: adding telemetry data and implementing panels.

Step 1: Adding Telemetry Data

PO uses OpenTelemetry for telemetry data collection. You can add traces, metrics, and logs — the signal types supported by OpenTelemetry.

There are two main methods for adding custom telemetry data:

Each method is described below.

Notes on adding telemetry data

To avoid conflicts with PO's default telemetry data collection, please note the following when creating custom telemetry data:

  • Ensure that service names do not overlap with existing telemetry data
    • It is recommended to prefix service names with custom or an organization-specific prefix
    • In particular, the following prefixes are reserved and must not be used
      • aibooster, faib, JobTracer

Adding from an application

PO runs an OTLP receiver (gRPC) on port 26697 on each node where the Agent is installed. Any tool that conforms to the OpenTelemetry Protocol (OTLP) can send telemetry data to localhost:26697. This allows you to record telemetry data from your own applications alongside the data collected by PO.

The following example shows how to add custom metrics from Python code.

  1. Install the OpenTelemetry SDK for Python.

    pip install \
    opentelemetry-api \
    opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-grpc
  2. Create and run the following metrics sending script.

    import time
    import random

    from opentelemetry import metrics
    from opentelemetry.sdk.resources import SERVICE_NAME, Resource
    from opentelemetry.sdk.metrics import MeterProvider
    from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
    from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

    MY_SERVICE_NAME = "custom-my-application"

    def configure_opentelemetry():
    resource = Resource(attributes={
    # Set service name (see "Notes on adding telemetry data")
    SERVICE_NAME: MY_SERVICE_NAME
    })
    exporter = OTLPMetricExporter(
    endpoint="http://localhost:26697", # Send to PO's collector
    insecure=True # Use HTTP for same-host communication
    )
    # Read and send metrics at regular intervals
    reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=30000 # Set export interval
    )
    provider = MeterProvider(resource=resource, metric_readers=[reader])
    metrics.set_meter_provider(provider)
    print("configure_opentelemetry done", flush=True)


    def main():
    configure_opentelemetry()
    # Create the required metrics
    meter = metrics.get_meter(MY_SERVICE_NAME)
    my_metric = meter.create_gauge(
    name="sdk_random",
    description="Random value by Python OpenTelemetry SDK",
    unit="1"
    )
    count = 0
    while True:
    # Set value and attributes for the metric
    # Note: actual export happens at the export interval
    # (every 30 seconds, not every 10 seconds)
    my_metric.set(random.random(), {"device_id": "0"})
    time.sleep(10)
    print(f"sending... {count}", flush=True)
    count += 1


    if __name__ == "__main__":
    main()

    Wait at least 3 minutes for enough metrics to accumulate before stopping the script. (You can also proceed with Step 2: Implementing custom panels while waiting.)

tip

The hostname where the Agent is running is collected by PO by default, so there is no need to collect it separately. Therefore, no special handling is required when running as a container application (however, port configuration is needed so that the OTLP receiver is accessible from within the container).

Adding a Prometheus-compatible exporter

You can also collect metrics from Prometheus-compatible exporters. You can use existing exporters or create your own.

The following example shows how to create a custom exporter (custom-my-exporter) that reports random values and add it to the configuration file.

  1. Install prometheus_client.

    pip install prometheus-client
  2. Create and run the following exporter script. It exposes metrics on port 8000.

    import random
    import time
    from prometheus_client import start_http_server, Gauge

    my_gauge = Gauge(
    "exporter_random",
    "A random value for testing",
    ["device_id"],
    )

    if __name__ == "__main__":
    start_http_server(8000)
    while True:
    my_gauge.labels(device_id="1").set(random.random())
    time.sleep(5)

    You can verify that metrics are available by running curl http://localhost:8000/metrics.

  3. Add the exporter to the OpenTelemetry Collector configuration file (/opt/aibooster/agent/static/agent-otel-collector.template.yaml).

    receivers:
    # ...(omitted)
    # prometheus/ is fixed; set the custom-my-exporter part to avoid duplication with others
    prometheus/custom-my-exporter:
    config:
    scrape_configs:
    # job_name is used as the service name (see "Notes on adding telemetry data")
    - job_name: 'custom-my-exporter'
    static_configs:
    - targets: ['localhost:8000'] # Match the exporter's port setting
    scrape_interval: 30s # Set the scrape interval
    # ...(omitted)


    service:
    pipelines:
    traces:
    receivers: [otlp]
    processors: [batch, resourcedetection]
    exporters: [clickhouse]
    metrics:
    receivers:
    - prometheus/custom-my-exporter # Add this line
    - otlp{{"\n"}}
    # - ...(omitted)
    note

    The template file must be updated on all nodes where the AIBooster PO Agent is installed.

    Once configured, restart the AIBooster PO Agent service to apply the settings.

    sudo systemctl restart aibooster-agent-endpoint

    Wait at least 3 minutes for enough metrics to accumulate. (You can also proceed with Step 2: Implementing custom panels while waiting.)

Step 2: Implementing custom panels

You can build panels with custom telemetry data and aggregation methods in the browser.

Select "Edit" mode -> "Add" -> "Visualization" to open the panel editor.

Here, as an example, we will create a panel that displays the two types of metrics created in Step 1: Adding telemetry data as a time series line chart.

Open the panel editor and configure it as shown in the following image. The image shows an example of displaying metrics created with the "Adding from an application" method. To display metrics created with "Adding a Prometheus-compatible exporter", modify the WHERE clause conditions to ServiceName = 'custom-my-exporter' and MetricName = 'exporter_random' respectively.

add custom metric

Set the data source to ClickHouse and verify that the panel type (upper right) and Query Type setting (around the center) match the image.

SELECT
toStartOfInterval(TimeUnix, INTERVAL 60 SECOND) AS time_bucket,
ResourceAttributes['host.name'] AS node,
avg(Value) AS result
FROM otel_metrics_gauge
WHERE
TimeUnix BETWEEN $__fromTime AND $__toTime
AND ServiceName = 'custom-my-application'
AND MetricName = 'sdk_random'
GROUP BY time_bucket, node
ORDER BY time_bucket, node
tip

If the graph is not displayed, check the following:

  • Verify that the time range setting at the top of the screen is appropriate.
  • Turn on Table View at the top of the screen to check the query results in a table. Verify that the query results are not empty.

The dashboard with two custom panels added will look like the following. If you want to adjust the panel display, you can also modify the parameters on the right side of the panel editor. Don't forget to click "Save dashboard" in the upper right corner after editing dashboards/panels.

result custom dashboard

All telemetry data collected by PO is stored in ClickHouse via OpenTelemetry. For details on how OpenTelemetry signals (telemetry data) are handled in ClickHouse, see clickhouseexporter and related documentation.

note

Please note the following when writing custom SQL:

  • When referencing custom telemetry data, always include ServiceName in the WHERE clause
  • PO updates may result in the deprecation or modification of collected telemetry data
    • If your custom panels reference PO standard telemetry data, verify that they work correctly after updates
    • Please check the release notes for changes. If you have any questions, contact customer support.