Skip to main content
Version: v2509

Metrics Details

Detailed list and descriptions of all metrics collected by AIBooster. The panel library provides 132 panels organized into the following categories.

Unit "-" represents dimensionless values.

GPU Metrics (DCGM)

DCGM metrics have two types of panels: panels with the exact panel names listed below, and panels with "by Node" suffix. Panels without "by Node" display metrics collected per node and per GPU in a list. Panels with "by Node" display the average values of all GPUs within each node, grouped by node.

Basic GPU Information (11 panels)

Metric NamePanel NameDescriptionUnit
DCGM_FI_DEV_GPU_UTILGPU UtilizationGPU utilization percentage%
DCGM_FI_DEV_GPU_TEMPGPU TemperatureGPU temperature
DCGM_FI_DEV_POWER_USAGEGPU Power UsageGPU power consumptionWatt
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONGPU Total Energy ConsumptionTotal energy consumption since bootJoule
DCGM_FI_DEV_FB_USEDGPU Memory UsedFramebuffer memory usedBytes
DCGM_FI_DEV_MEM_CLOCKGPU Memory ClockMemory clock frequencyMHz
DCGM_FI_DEV_SM_CLOCKGPU SM ClockStreaming multiprocessor clock frequencyMHz
DCGM_FI_DEV_MEMORY_TEMPGPU Memory TemperatureGPU memory temperature
DCGM_FI_DEV_MEM_COPY_UTILGPU Memory Copy UtilizationGPU memory copy utilization%
DCGM_FI_DEV_PCIE_REPLAY_COUNTERGPU PCIe Replay CounterPCIe retry counter-
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALGPU NVLink Bandwidth TotalNVLink bandwidth totalBytes/sec

Profiling Information (6 panels)

Metric NamePanel NameDescriptionUnit
DCGM_FI_PROF_DRAM_ACTIVEGPU DRAM ActivityDRAM utilization%
DCGM_FI_PROF_PCIE_RX_BYTESGPU PCIe RXPCIe receive bytesBytes/sec
DCGM_FI_PROF_PCIE_TX_BYTESGPU PCIe TXPCIe transmit bytesBytes/sec
DCGM_FI_PROF_SM_ACTIVEGPU SM ActivityStreaming multiprocessor utilization%
DCGM_FI_PROF_SM_OCCUPANCYGPU SM OccupancyStreaming multiprocessor occupancy%
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEGPU Tensor Core ActivityTensor core utilization%
DCGM_FI_PROF_PIPE_FP32_ACTIVEGPU FP16 Pipeline ActivityFP16 pipeline utilization%
DCGM_FI_PROF_PIPE_FP16_ACTIVEGPU FP32 Pipeline ActivityFP32 pipeline utilization%
DCGM_FI_PROF_PIPE_FP64_ACTIVEGPU FP64 Pipeline ActivityFP64 pipeline utilization%
DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVEGPU Tensor IMMA Pipeline ActivityIMMA pipeline utilization%
DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVEGPU Tensor HMMA Pipeline ActivityHMMA pipeline utilization%
DCGM_FI_PROF_PIPE_TENSOR_DFMA_ACTIVEGPU Tensor DFMA Pipeline ActivityDFMA pipeline utilization%

System Metrics (Node Exporter)

CPU & Load (7 panels)

Metric NamePanel NameDescriptionUnit
node_load1Node Load 1min1-minute system load average-
node_load5Node Load 5min5-minute system load average-
node_load15Node Load 15min15-minute system load average-
node_cpu_frequency_max_hertzNode CPU Frequency Max HertzMaximum CPU frequencyHz
node_cpu_frequency_min_hertzNode CPU Frequency Min HertzMinimum CPU frequencyHz
node_cpu_scaling_frequency_hertzNode CPU Scaling Frequency HertzCurrent CPU operating frequencyHz
node_cpu_scaling_governorNode CPU Scaling GovernorCPU governor setting status-

Memory (9 panels)

Metric NamePanel NameDescriptionUnit
node_memory_MemTotal_bytesNode Memory TotalTotal memory capacityBytes
node_memory_MemAvailable_bytesNode Memory AvailableAvailable memory capacityBytes
node_memory_MemFree_bytesNode Memory FreeFree memory capacityBytes
node_memory_Active_bytesNode Memory ActiveActive memory usageBytes
node_memory_Inactive_bytesNode Memory InactiveInactive memory usageBytes
node_memory_Cached_bytesNode Memory CachedCache memory usageBytes
node_memory_Buffers_bytesNode Memory BuffersBuffer memory usageBytes
node_memory_SwapTotal_bytesNode Swap TotalTotal swap capacityBytes
node_memory_SwapFree_bytesNode Swap FreeFree swap capacityBytes

Filesystem (5 panels)

Metric NamePanel NameDescriptionUnit
node_filesystem_size_bytesNode Filesystem Size BytesTotal filesystem capacityBytes
node_filesystem_avail_bytesNode Filesystem Avail BytesAvailable filesystem capacityBytes
node_filesystem_free_bytesNode Filesystem Free BytesFree filesystem capacityBytes
node_filesystem_filesNode Filesystem FilesTotal inode count-
node_filesystem_files_freeNode Filesystem Files FreeFree inode count-

Network (4 panels)

Metric NamePanel NameDescriptionUnit
node_network_infoNode Network Interface InfoNetwork interface information-
node_network_upNode Network UpNetwork interface status-
node_network_speed_bytesNode Network Speed BytesNetwork speedBytes/sec
node_network_mtu_bytesNode Network Mtu BytesMaximum Transmission UnitBytes

Processes (2 panels)

Metric NamePanel NameDescriptionUnit
node_procs_runningNode Procs RunningRunning process count-
node_procs_blockedNode Procs BlockedBlocked process count-

File Descriptors (3 panels)

Metric NamePanel NameDescriptionUnit
node_filefd_allocatedNode Filefd AllocatedAllocated file descriptor count-
node_filefd_maximumNode Filefd MaximumMaximum file descriptor count-
node_arp_entriesNode Arp EntriesARP table entry count-

System Boot Time (1 panel)

Metric NamePanel NameDescriptionUnit
node_boot_time_secondsNode Boot TimeSystem boot timeDateTime

ARC Cache (10 panels)

Metric NamePanel NameDescriptionUnit
node_zfs_arc_sizeNode ZFS ARC SizeCurrent ARC cache sizeBytes
node_zfs_arc_cNode ZFS ARC CARC target sizeBytes
node_zfs_arc_c_maxNode ZFS ARC C MaxARC maximum sizeBytes
node_zfs_arc_c_minNode ZFS ARC C MinARC minimum sizeBytes
node_zfs_arc_hitsNode ZFS ARC HitsARC cache hit count-
node_zfs_arc_missesNode ZFS ARC MissesARC cache miss count-
node_zfs_arc_mfu_hitsNode ZFS ARC MFU HitsMost Frequently Used hit count-
node_zfs_arc_mru_hitsNode ZFS ARC MRU HitsMost Recently Used hit count-
node_zfs_arc_demand_data_hitsNode ZFS ARC Demand Data HitsDemand data hit count-
node_zfs_arc_demand_data_missesNode ZFS ARC Demand Data MissesDemand data miss count-

ZFS Pool (6 panels)

Metric NamePanel NameDescriptionUnit
node_zfs_zpool_stateNode ZFS Zpool StateZFS pool status-
node_zfs_zpool_dataset_nreadNode ZFS Zpool Dataset NreadDataset read count-
node_zfs_zpool_dataset_nwrittenNode ZFS Zpool Dataset NwrittenDataset write count-
node_zfs_zpool_dataset_readsNode ZFS Zpool Dataset ReadsDataset read bytesBytes
node_zfs_zpool_dataset_writesNode ZFS Zpool Dataset WritesDataset write bytesBytes
node_zfs_zpool_dataset_nunlinksNode ZFS Dataset UnlinksDataset unlink count-

Process & Application Metrics

Go Applications (18 panels)

Basic Information (4 panels)

Metric NamePanel NameDescriptionUnit
go_infoGo Version InfoGo language version information-
go_goroutinesGo GoroutinesRunning goroutine count-
go_threadsGo ThreadsThread count-
go_sched_gomaxprocs_threadsGo MAX ProcessorsGOMAXPROCS setting value-

Garbage Collection (2 panels)

Metric NamePanel NameDescriptionUnit
go_gc_gogc_percentGo GC TargetGOGC setting value%
go_gc_gomemlimit_bytesGo Memory LimitGOMEMLIMIT setting valueBytes

Memory Statistics (12 panels)

Metric NamePanel NameDescriptionUnit
go_memstats_alloc_bytesGo Alloc MemoryAllocated memoryBytes
go_memstats_sys_bytesGo System MemorySystem allocated memoryBytes
go_memstats_heap_alloc_bytesGo Heap AllocHeap allocated memoryBytes
go_memstats_heap_sys_bytesGo Heap SystemHeap system memoryBytes
go_memstats_heap_idle_bytesGo Heap IdleHeap idle memoryBytes
go_memstats_heap_inuse_bytesGo Heap In UseHeap in-use memoryBytes
go_memstats_heap_released_bytesGo Heap ReleasedHeap released memoryBytes
go_memstats_heap_objectsGo Heap ObjectsHeap object count-
go_memstats_stack_inuse_bytesGo Stack In UseStack in-use memoryBytes
go_memstats_stack_sys_bytesGo Stack SystemStack system memoryBytes
go_memstats_mspan_inuse_bytesGo MSpan In UseMSpan in-use memoryBytes
go_memstats_mspan_sys_bytesGo MSpan SystemMSpan system memoryBytes

Process Information (6 panels)

Metric NamePanel NameDescriptionUnit
process_resident_memory_bytesProcess Resident MemoryProcess physical memory usageBytes
process_virtual_memory_bytesProcess Virtual MemoryProcess virtual memory usageBytes
process_virtual_memory_max_bytesProcess Virtual Memory MaxProcess maximum virtual memoryBytes
process_open_fdsProcess Open FDsProcess open file descriptor count-
process_max_fdsProcess Max FDsProcess maximum file descriptor count-
process_start_time_secondsProcess Start TimeProcess start timeSeconds

Scraping Statistics (5 panels)

Metric NamePanel NameDescriptionUnit
scrape_duration_secondsScrape DurationMetric collection timeSeconds
scrape_samples_scrapedScrape Samples ScrapedCollected sample count-
scrape_samples_post_metric_relabelingScrape Samples Post RelabelingPost-relabeling sample count-
scrape_series_addedScrape Series AddedAdded time series count-
upTarget Up StatusScrape success status-

Other System Metrics

Memory Cache (6 panels)

Metric NamePanel NameDescriptionUnit
go_memstats_mcache_inuse_bytesGo MCache In UseMCache in-use memoryBytes
go_memstats_mcache_sys_bytesGo MCache SystemMCache system memoryBytes
go_memstats_gc_sys_bytesGo GC System MemoryGC system memoryBytes
go_memstats_other_sys_bytesGo Other System MemoryOther system memoryBytes
go_memstats_buck_hash_sys_bytesGo Bucket Hash System MemoryBucket hash system memoryBytes
go_memstats_next_gc_bytesGo Next GCNext GC thresholdBytes

GC Statistics (1 panel)

Metric NamePanel NameDescriptionUnit
go_memstats_last_gc_time_secondsGo Last GC TimeLast GC execution timeSeconds

HTTP Statistics (1 panel)

Metric NamePanel NameDescriptionUnit
promhttp_metric_handler_requests_in_flightHTTP Requests In FlightIn-flight HTTP request count-