Cluster GPU Monitoring

1. Introduction

Uk8s utilizes the open source component Dcgm-Exporter to obtain GPU related monitoring indicators, mainly including:

GPU card utilization
Container GPU resource utilization

2. Deployment

2.1. If the Monitoring Center is Not Enabled

After enabling the Monitoring Center, you can view the Dashboard NVIDIA/DCGM/Exporter/Node and NVIDIA/DCGM/Exporter/Container on the Grafana page.

2.2. If the Monitoring Center is Already Enabled

⚠️ If the version of the monitoring center 1.0.6 > version >= 1.0.5-3 or version > 1.0.6, the deployment file below is installed by default, please skip the following deployment content, otherwise, you need to carry out the following deployment.

2.2.1. Deployment of Dcgm-Exporter


kubectl apply -f https://docs.surfercloud.com/uk8s/yaml/gpu-share/dcgm-exporter.yaml

2.2.2. Deployment of NVIDIA/DCGM/Exporter/Node Dashboard

After logging into Grafana, you need to first download the json file —> Select ’+’ on the left navigation bar —> Import —> Paste the downloaded json content into the second input box —> Load

2.2.3. Deployment of NVIDIA/DCGM/Exporter/Container Dashboard

⚠️ The official chart does not contain container-related information, if you need to view the GPU related information of the container, you need to import the Dashboard made by Uk8s.

3. Test

You can quickly start a GPU Pod with the following command. This Pod will run for a period of time and end. You can then check the GPU usage of this Pod on the NVIDIA/DCGM/Exporter/Container Dashboard in Grafana.


cat << EOF | kubectl create -f -
 apiVersion: v1
 kind: Pod
 metadata:
   name: dcgmproftester
 spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
     image: uhub.surfercloud.com/uk8s/dcgmproftester
     args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
     resources:
       limits:
          nvidia.com/gpu: 1
     securityContext:
       capabilities:
          add: ["SYS_ADMIN"]
 
EOF

4. Dashboard Chart

Dashboard	Grafana Charts	Function
NVIDIA/DCGM/Exporter/Node	GPU Temperature	GPU Card Temperature
NVIDIA/DCGM/Exporter/Node	GPU Power Usage	GPU Power Consumption
NVIDIA/DCGM/Exporter/Node	GPU SM Clocks	GPU Clock Frequency
NVIDIA/DCGM/Exporter/Node	GPU Utilization	GPU Utilization
NVIDIA/DCGM/Exporter/Node	Tensor Core Utilization	Fraction of cycles that Tensor Pipes are active
NVIDIA/DCGM/Exporter/Node	GPU Framebuffer Mem Used	GPU Memory Usage
NVIDIA/DCGM/Exporter/Node	GPU XID Error	GPU Card Dropping
NVIDIA/DCGM/Exporter/Container	GPU Utilization	Container GPU Utilization
NVIDIA/DCGM/Exporter/Container	GPU Framebuffer Mem	Container GPU Memory Usage & Remaining
NVIDIA/DCGM/Exporter/Container	GPU Memory Usage	Container GPU Memory Usage Rate

5. Monitoring Rules

We have configured the GPU Card Dropping alarm rule by default. If there is a need to add new alarm rules, you can change the alarm rules with the following command.


kubectl -n uk8s-monitor edit prometheusrule uk8s-gpu

6. Common DCGM Metrics

6.1. Utilization

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_GPU_UTIL	Gauge	%	GPU Utilization
DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	GPU Memory Bandwidth Utilization
DCGM_FI_DEV_ENC_UTIL	Gauge	%	GPU Encoder Utilization
DCGM_FI_DEV_DEC_UTIL	Gauge	%	GPU Decoder Utilization

6.2. Memory

In GPU, the video card memory (video memory) is also called frame buffer.

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_FB_FREE	Gauge	MiB	GPU Frame Buffer Remaining
DCGM_FI_DEV_FB_USED	Gauge	MiB	GPU Frame Buffer Used

6.3. Frequency

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_SM_CLOCK	Gauge	MHz	GPU SM Clock Frequency
DCGM_FI_DEV_MEM_CLOCK	Gauge	MHz	GPU Memory Clock Frequency

6.4. Profiling

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Gauge	%	The proportion of time the Graphics or Compute engine is Active within a time interval.
DCGM_FI_PROF_SM_ACTIVE	Gauge	%	The percentage of time at least one thread bun is Active on an SM (Streaming Multiprocessor) within a time interval, the value is the average of all SMs.
DCGM_FI_PROF_SM_OCCUPANCY	Gauge	%	The ratio of the thread bundles residing on SM to the maximum amount of thread bundles that can reside on SM within a time interval, the value is the average of all SMs.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Gauge	%	The fraction of cycles the Tensor Pipes are Active per unit time.
DCGM_FI_PROF_DRAM_ACTIVE	Gauge	%	The fraction of active memory copy cycles (a cycle with a DRAM instruction is considered 100% for that cycle).
DCGM_FI_PROF_PIPE_FP64_ACTIVE	Gauge	%	The fraction of cycles F64 Pipes are Active per unit time.
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Gauge	%	The fraction of cycles F32 Pipes are Active per unit time.
DCGM_FI_PROF_PIPE_FP16_ACTIVE	Gauge	%	The fraction of cycles F16 Pipes are Active per unit time.
DCGM_FI_PROF_NVLINK_RX_BYTES	Counter	B/s	Data flow received via NVLink.
DCGM_FI_PROF_NVLINK_TX_BYTES	Counter	B/s	Data flow transmitted via NVLink.
DCGM_FI_PROF_PCIE_RX_BYTES	Counter	B/s	Number of bytes received via PCIe bus.
DCGM_FI_PROF_PCIE_TX_BYTES	Counter	B/s	Number of bytes transmitted via PCIe bus.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER	Counter	Times	Retry times for GPU PCIe bus.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	Counter	-	The total count of NVLink bandwidth counters for all GPU channels.

6.5. Temperature and Power

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_GPU_TEMP	Gauge	℃	Current GPU Temperature
DCGM_FI_DEV_MEMORY_TEMP	Gauge	℃	Current GPU Memory Temperature
DCGM_FI_DEV_POWER_USAGE	Gauge	W	Current GPU Power Consumption
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Count	mJ	Total Energy Consumption since GPU Startup

6.6. XID Errors and Violations

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_XID_ERRORS	Gauge	-	The recent error code
DCGM_CUSTOM_XID_ERRORS_TOTAL_COUNTER	Counter	-	Total number of error codes
DCGM_FI_DEV_POWER_VIOLATION	Counter	μs	The cumulative duration of violations due to power limits
DCGM_FI_DEV_THERMAL_VIOLATION	Counter	μs	The cumulative duration of violations due to thermal limits
DCGM_FI_DEV_SYNC_BOOST_VIOLATION	Counter	μs	The cumulative duration of violations due to synchronous boosting limits
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	Counter	μs	The cumulative duration of violations due to circuit board limits
DCGM_FI_DEV_LOW_UTIL_VIOLATION	Counter	μs	The cumulative duration of violations due to low utilization limits
DCGM_FI_DEV_RELIABILITY_VIOLATION	Counter	μs	The cumulative duration of violations due to circuit board reliability limits

6.7. Disabled Memory Pages

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_RETIRED_SBE	Counter	Individual	Memory pages disabled due to single bit errors
DCGM_FI_DEV_RETIRED_DBE	Counter	Individual	Memory pages disabled due to double bit errors

6.8. Others

Metric Name	Metric Type	Metric Unit	Metric Meaning
DCGM_FI_DEV_VGPU_LICENSE_STATUS	Gauge	-	vGPU License Status
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS	Counter	-	Number of rows remapped due to uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS	Counter	-	Number of rows remapped due to correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE	Gauge	-	Whether row remapping failed