Metrics Details
Detailed list and descriptions of all metrics collected by AIBooster. The panel library provides 132 panels organized into the following categories.
Unit "-" represents dimensionless values.
GPU Metrics (DCGM)
DCGM metrics have two types of panels: panels with the exact panel names listed below, and panels with "by Node" suffix. Panels without "by Node" display metrics collected per node and per GPU in a list. Panels with "by Node" display the average values of all GPUs within each node, grouped by node.
Basic GPU Information (11 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU Utilization | GPU utilization percentage | % |
DCGM_FI_DEV_GPU_TEMP | GPU Temperature | GPU temperature | ℃ |
DCGM_FI_DEV_POWER_USAGE | GPU Power Usage | GPU power consumption | Watt |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | GPU Total Energy Consumption | Total energy consumption since boot | Joule |
DCGM_FI_DEV_FB_USED | GPU Memory Used | Framebuffer memory used | Bytes |
DCGM_FI_DEV_MEM_CLOCK | GPU Memory Clock | Memory clock frequency | MHz |
DCGM_FI_DEV_SM_CLOCK | GPU SM Clock | Streaming multiprocessor clock frequency | MHz |
DCGM_FI_DEV_MEMORY_TEMP | GPU Memory Temperature | GPU memory temperature | ℃ |
DCGM_FI_DEV_MEM_COPY_UTIL | GPU Memory Copy Utilization | GPU memory copy utilization | % |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | GPU PCIe Replay Counter | PCIe retry counter | - |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | GPU NVLink Bandwidth Total | NVLink bandwidth total | Bytes/sec |
Profiling Information (6 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
DCGM_FI_PROF_DRAM_ACTIVE | GPU DRAM Activity | DRAM utilization | % |
DCGM_FI_PROF_PCIE_RX_BYTES | GPU PCIe RX | PCIe receive bytes | Bytes/sec |
DCGM_FI_PROF_PCIE_TX_BYTES | GPU PCIe TX | PCIe transmit bytes | Bytes/sec |
DCGM_FI_PROF_SM_ACTIVE | GPU SM Activity | Streaming multiprocessor utilization | % |
DCGM_FI_PROF_SM_OCCUPANCY | GPU SM Occupancy | Streaming multiprocessor occupancy | % |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | GPU Tensor Core Activity | Tensor core utilization | % |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | GPU FP16 Pipeline Activity | FP16 pipeline utilization | % |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | GPU FP32 Pipeline Activity | FP32 pipeline utilization | % |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | GPU FP64 Pipeline Activity | FP64 pipeline utilization | % |
DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE | GPU Tensor IMMA Pipeline Activity | IMMA pipeline utilization | % |
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE | GPU Tensor HMMA Pipeline Activity | HMMA pipeline utilization | % |
DCGM_FI_PROF_PIPE_TENSOR_DFMA_ACTIVE | GPU Tensor DFMA Pipeline Activity | DFMA pipeline utilization | % |
System Metrics (Node Exporter)
CPU & Load (7 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_load1 | Node Load 1min | 1-minute system load average | - |
node_load5 | Node Load 5min | 5-minute system load average | - |
node_load15 | Node Load 15min | 15-minute system load average | - |
node_cpu_frequency_max_hertz | Node CPU Frequency Max Hertz | Maximum CPU frequency | Hz |
node_cpu_frequency_min_hertz | Node CPU Frequency Min Hertz | Minimum CPU frequency | Hz |
node_cpu_scaling_frequency_hertz | Node CPU Scaling Frequency Hertz | Current CPU operating frequency | Hz |
node_cpu_scaling_governor | Node CPU Scaling Governor | CPU governor setting status | - |
Memory (9 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_memory_MemTotal_bytes | Node Memory Total | Total memory capacity | Bytes |
node_memory_MemAvailable_bytes | Node Memory Available | Available memory capacity | Bytes |
node_memory_MemFree_bytes | Node Memory Free | Free memory capacity | Bytes |
node_memory_Active_bytes | Node Memory Active | Active memory usage | Bytes |
node_memory_Inactive_bytes | Node Memory Inactive | Inactive memory usage | Bytes |
node_memory_Cached_bytes | Node Memory Cached | Cache memory usage | Bytes |
node_memory_Buffers_bytes | Node Memory Buffers | Buffer memory usage | Bytes |
node_memory_SwapTotal_bytes | Node Swap Total | Total swap capacity | Bytes |
node_memory_SwapFree_bytes | Node Swap Free | Free swap capacity | Bytes |
Filesystem (5 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_filesystem_size_bytes | Node Filesystem Size Bytes | Total filesystem capacity | Bytes |
node_filesystem_avail_bytes | Node Filesystem Avail Bytes | Available filesystem capacity | Bytes |
node_filesystem_free_bytes | Node Filesystem Free Bytes | Free filesystem capacity | Bytes |
node_filesystem_files | Node Filesystem Files | Total inode count | - |
node_filesystem_files_free | Node Filesystem Files Free | Free inode count | - |
Network (4 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_network_info | Node Network Interface Info | Network interface information | - |
node_network_up | Node Network Up | Network interface status | - |
node_network_speed_bytes | Node Network Speed Bytes | Network speed | Bytes/sec |
node_network_mtu_bytes | Node Network Mtu Bytes | Maximum Transmission Unit | Bytes |
Processes (2 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_procs_running | Node Procs Running | Running process count | - |
node_procs_blocked | Node Procs Blocked | Blocked process count | - |
File Descriptors (3 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_filefd_allocated | Node Filefd Allocated | Allocated file descriptor count | - |
node_filefd_maximum | Node Filefd Maximum | Maximum file descriptor count | - |
node_arp_entries | Node Arp Entries | ARP table entry count | - |
System Boot Time (1 panel)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_boot_time_seconds | Node Boot Time | System boot time | DateTime |
ZFS Related (16 panels)
ARC Cache (10 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_zfs_arc_size | Node ZFS ARC Size | Current ARC cache size | Bytes |
node_zfs_arc_c | Node ZFS ARC C | ARC target size | Bytes |
node_zfs_arc_c_max | Node ZFS ARC C Max | ARC maximum size | Bytes |
node_zfs_arc_c_min | Node ZFS ARC C Min | ARC minimum size | Bytes |
node_zfs_arc_hits | Node ZFS ARC Hits | ARC cache hit count | - |
node_zfs_arc_misses | Node ZFS ARC Misses | ARC cache miss count | - |
node_zfs_arc_mfu_hits | Node ZFS ARC MFU Hits | Most Frequently Used hit count | - |
node_zfs_arc_mru_hits | Node ZFS ARC MRU Hits | Most Recently Used hit count | - |
node_zfs_arc_demand_data_hits | Node ZFS ARC Demand Data Hits | Demand data hit count | - |
node_zfs_arc_demand_data_misses | Node ZFS ARC Demand Data Misses | Demand data miss count | - |
ZFS Pool (6 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
node_zfs_zpool_state | Node ZFS Zpool State | ZFS pool status | - |
node_zfs_zpool_dataset_nread | Node ZFS Zpool Dataset Nread | Dataset read count | - |
node_zfs_zpool_dataset_nwritten | Node ZFS Zpool Dataset Nwritten | Dataset write count | - |
node_zfs_zpool_dataset_reads | Node ZFS Zpool Dataset Reads | Dataset read bytes | Bytes |
node_zfs_zpool_dataset_writes | Node ZFS Zpool Dataset Writes | Dataset write bytes | Bytes |
node_zfs_zpool_dataset_nunlinks | Node ZFS Dataset Unlinks | Dataset unlink count | - |
Process & Application Metrics
Go Applications (18 panels)
Basic Information (4 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
go_info | Go Version Info | Go language version information | - |
go_goroutines | Go Goroutines | Running goroutine count | - |
go_threads | Go Threads | Thread count | - |
go_sched_gomaxprocs_threads | Go MAX Processors | GOMAXPROCS setting value | - |
Garbage Collection (2 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
go_gc_gogc_percent | Go GC Target | GOGC setting value | % |
go_gc_gomemlimit_bytes | Go Memory Limit | GOMEMLIMIT setting value | Bytes |
Memory Statistics (12 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
go_memstats_alloc_bytes | Go Alloc Memory | Allocated memory | Bytes |
go_memstats_sys_bytes | Go System Memory | System allocated memory | Bytes |
go_memstats_heap_alloc_bytes | Go Heap Alloc | Heap allocated memory | Bytes |
go_memstats_heap_sys_bytes | Go Heap System | Heap system memory | Bytes |
go_memstats_heap_idle_bytes | Go Heap Idle | Heap idle memory | Bytes |
go_memstats_heap_inuse_bytes | Go Heap In Use | Heap in-use memory | Bytes |
go_memstats_heap_released_bytes | Go Heap Released | Heap released memory | Bytes |
go_memstats_heap_objects | Go Heap Objects | Heap object count | - |
go_memstats_stack_inuse_bytes | Go Stack In Use | Stack in-use memory | Bytes |
go_memstats_stack_sys_bytes | Go Stack System | Stack system memory | Bytes |
go_memstats_mspan_inuse_bytes | Go MSpan In Use | MSpan in-use memory | Bytes |
go_memstats_mspan_sys_bytes | Go MSpan System | MSpan system memory | Bytes |
Process Information (6 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
process_resident_memory_bytes | Process Resident Memory | Process physical memory usage | Bytes |
process_virtual_memory_bytes | Process Virtual Memory | Process virtual memory usage | Bytes |
process_virtual_memory_max_bytes | Process Virtual Memory Max | Process maximum virtual memory | Bytes |
process_open_fds | Process Open FDs | Process open file descriptor count | - |
process_max_fds | Process Max FDs | Process maximum file descriptor count | - |
process_start_time_seconds | Process Start Time | Process start time | Seconds |
Scraping Statistics (5 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
scrape_duration_seconds | Scrape Duration | Metric collection time | Seconds |
scrape_samples_scraped | Scrape Samples Scraped | Collected sample count | - |
scrape_samples_post_metric_relabeling | Scrape Samples Post Relabeling | Post-relabeling sample count | - |
scrape_series_added | Scrape Series Added | Added time series count | - |
up | Target Up Status | Scrape success status | - |
Other System Metrics
Memory Cache (6 panels)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
go_memstats_mcache_inuse_bytes | Go MCache In Use | MCache in-use memory | Bytes |
go_memstats_mcache_sys_bytes | Go MCache System | MCache system memory | Bytes |
go_memstats_gc_sys_bytes | Go GC System Memory | GC system memory | Bytes |
go_memstats_other_sys_bytes | Go Other System Memory | Other system memory | Bytes |
go_memstats_buck_hash_sys_bytes | Go Bucket Hash System Memory | Bucket hash system memory | Bytes |
go_memstats_next_gc_bytes | Go Next GC | Next GC threshold | Bytes |
GC Statistics (1 panel)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
go_memstats_last_gc_time_seconds | Go Last GC Time | Last GC execution time | Seconds |
HTTP Statistics (1 panel)
Metric Name | Panel Name | Description | Unit |
---|---|---|---|
promhttp_metric_handler_requests_in_flight | HTTP Requests In Flight | In-flight HTTP request count | - |