Cluster GPU Monitoring
1. Introduction
Uk8s utilizes the open source component Dcgm-Exporter (opens in a new tab) to obtain GPU related monitoring indicators, mainly including:
- GPU card utilization
- Container GPU resource utilization
2. Deployment
2.1. If the Monitoring Center is Not Enabled
After enabling the Monitoring Center, you can view the Dashboard NVIDIA/DCGM/Exporter/Node
and NVIDIA/DCGM/Exporter/Container
on the Grafana page.
2.2. If the Monitoring Center is Already Enabled
⚠️ If the version of the monitoring center
1.0.6 > version >= 1.0.5-3
orversion > 1.0.6
, the deployment file below is installed by default, please skip the following deployment content, otherwise, you need to carry out the following deployment.
2.2.1. Deployment of Dcgm-Exporter
kubectl apply -f https://docs.surfercloud.com/uk8s/yaml/gpu-share/dcgm-exporter.yaml
2.2.2. Deployment of NVIDIA/DCGM/Exporter/Node Dashboard
After logging into Grafana, you need to first download the json file (opens in a new tab) --> Select '+' on the left navigation bar --> Import --> Paste the downloaded json content into the second input box --> Load
2.2.3. Deployment of NVIDIA/DCGM/Exporter/Container Dashboard
⚠️ The official chart does not contain container-related information, if you need to view the GPU related information of the container, you need to import the Dashboard made by Uk8s.
After logging into Grafana, you need to first download the json file (opens in a new tab) --> Select '+' on the left navigation bar --> Import --> Paste the downloaded json content into the second input box --> Load
3. Test
You can quickly start a GPU Pod with the following command. This Pod will run for a period of time and end. You can then check the GPU usage of this Pod on the NVIDIA/DCGM/Exporter/Container
Dashboard in Grafana.
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: dcgmproftester
spec:
restartPolicy: OnFailure
containers:
- name: dcgmproftester11
image: uhub.surfercloud.com/uk8s/dcgmproftester
args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
resources:
limits:
nvidia.com/gpu: 1
securityContext:
capabilities:
add: ["SYS_ADMIN"]
EOF
4. Dashboard Chart
Dashboard | Grafana Charts | Function |
---|---|---|
NVIDIA/DCGM/Exporter/Node | GPU Temperature | GPU Card Temperature |
NVIDIA/DCGM/Exporter/Node | GPU Power Usage | GPU Power Consumption |
NVIDIA/DCGM/Exporter/Node | GPU SM Clocks | GPU Clock Frequency |
NVIDIA/DCGM/Exporter/Node | GPU Utilization | GPU Utilization |
NVIDIA/DCGM/Exporter/Node | Tensor Core Utilization | Fraction of cycles that Tensor Pipes are active |
NVIDIA/DCGM/Exporter/Node | GPU Framebuffer Mem Used | GPU Memory Usage |
NVIDIA/DCGM/Exporter/Node | GPU XID Error | GPU Card Dropping |
NVIDIA/DCGM/Exporter/Container | GPU Utilization | Container GPU Utilization |
NVIDIA/DCGM/Exporter/Container | GPU Framebuffer Mem | Container GPU Memory Usage & Remaining |
NVIDIA/DCGM/Exporter/Container | GPU Memory Usage | Container GPU Memory Usage Rate |
5. Monitoring Rules
We have configured the GPU Card Dropping
alarm rule by default. If there is a need to add new alarm rules, you can change the alarm rules with the following command.
kubectl -n uk8s-monitor edit prometheusrule uk8s-gpu
6. Common DCGM Metrics
6.1. Utilization
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_GPU_UTIL | Gauge | % | GPU Utilization |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | GPU Memory Bandwidth Utilization |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | GPU Encoder Utilization |
DCGM_FI_DEV_DEC_UTIL | Gauge | % | GPU Decoder Utilization |
6.2. Memory
In GPU, the video card memory (video memory) is also called frame buffer.
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_FB_FREE | Gauge | MiB | GPU Frame Buffer Remaining |
DCGM_FI_DEV_FB_USED | Gauge | MiB | GPU Frame Buffer Used |
6.3. Frequency
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | GPU SM Clock Frequency |
DCGM_FI_DEV_MEM_CLOCK | Gauge | MHz | GPU Memory Clock Frequency |
6.4. Profiling
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Gauge | % | The proportion of time the Graphics or Compute engine is Active within a time interval. |
DCGM_FI_PROF_SM_ACTIVE | Gauge | % | The percentage of time at least one thread bun is Active on an SM (Streaming Multiprocessor) within a time interval, the value is the average of all SMs. |
DCGM_FI_PROF_SM_OCCUPANCY | Gauge | % | The ratio of the thread bundles residing on SM to the maximum amount of thread bundles that can reside on SM within a time interval, the value is the average of all SMs. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | The fraction of cycles the Tensor Pipes are Active per unit time. |
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | The fraction of active memory copy cycles (a cycle with a DRAM instruction is considered 100% for that cycle). |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | Gauge | % | The fraction of cycles F64 Pipes are Active per unit time. |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Gauge | % | The fraction of cycles F32 Pipes are Active per unit time. |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | Gauge | % | The fraction of cycles F16 Pipes are Active per unit time. |
DCGM_FI_PROF_NVLINK_RX_BYTES | Counter | B/s | Data flow received via NVLink. |
DCGM_FI_PROF_NVLINK_TX_BYTES | Counter | B/s | Data flow transmitted via NVLink. |
DCGM_FI_PROF_PCIE_RX_BYTES | Counter | B/s | Number of bytes received via PCIe bus. |
DCGM_FI_PROF_PCIE_TX_BYTES | Counter | B/s | Number of bytes transmitted via PCIe bus. |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | Counter | Times | Retry times for GPU PCIe bus. |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Counter | - | The total count of NVLink bandwidth counters for all GPU channels. |
6.5. Temperature and Power
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_GPU_TEMP | Gauge | ℃ | Current GPU Temperature |
DCGM_FI_DEV_MEMORY_TEMP | Gauge | ℃ | Current GPU Memory Temperature |
DCGM_FI_DEV_POWER_USAGE | Gauge | W | Current GPU Power Consumption |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Count | mJ | Total Energy Consumption since GPU Startup |
6.6. XID Errors and Violations
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_XID_ERRORS | Gauge | - | The recent error code |
DCGM_CUSTOM_XID_ERRORS_TOTAL_COUNTER | Counter | - | Total number of error codes |
DCGM_FI_DEV_POWER_VIOLATION | Counter | μs | The cumulative duration of violations due to power limits |
DCGM_FI_DEV_THERMAL_VIOLATION | Counter | μs | The cumulative duration of violations due to thermal limits |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | Counter | μs | The cumulative duration of violations due to synchronous boosting limits |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | Counter | μs | The cumulative duration of violations due to circuit board limits |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | Counter | μs | The cumulative duration of violations due to low utilization limits |
DCGM_FI_DEV_RELIABILITY_VIOLATION | Counter | μs | The cumulative duration of violations due to circuit board reliability limits |
6.7. Disabled Memory Pages
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_RETIRED_SBE | Counter | Individual | Memory pages disabled due to single bit errors |
DCGM_FI_DEV_RETIRED_DBE | Counter | Individual | Memory pages disabled due to double bit errors |
6.8. Others
Metric Name | Metric Type | Metric Unit | Metric Meaning |
---|---|---|---|
DCGM_FI_DEV_VGPU_LICENSE_STATUS | Gauge | - | vGPU License Status |
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | Counter | - | Number of rows remapped due to uncorrectable errors |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | Counter | - | Number of rows remapped due to correctable errors |
DCGM_FI_DEV_ROW_REMAP_FAILURE | Gauge | - | Whether row remapping failed |