Adjustment of Pod's Tolerance Time for Node Exceptions
1. Principle Explanation
After a node in the Kubernetes cluster is in an abnormal state, there needs to be a waiting time before the Pods on the node are evicted. So for some key businesses, can this time be adjusted to promptly evict the Pod and rebuild it on other healthy nodes when the node encounter an exception?
To solve this problem, we first need to understand the mechanism of Kubernetes evicting Pods when nodes are in abnormal state.
In Kubernetes 1.13 and later versions, the two feature gates, TaintBasedEvictions
and TaintNodesByCondition
, are enabled by default. The lifecycle management of nodes and their Pods is done through the node's Condition and Taint. Kubernetes keeps on checking the status of all nodes, setting corresponding Conditions, and setting corresponding Taints for nodes based on Condition, and then evicting Pods on the node based on Taint.
At the same time, when creating a Pod, the tolerationSeconds
parameter is added to the Pod by default, specifying how long the Pod will continue to run on this node when the node goes abnormal (such as NotReady).
So, the time from a node being abnormal to a Pod being evicted is determined by two parameters: 1. The time from the actual node exception to being judged unhealthy; 2. The Pod's tolerance time for unhealthy nodes.
In the Kubernetes cluster, the default time from the actual node exception to being judged unhealthy is 40s, and the Pod's tolerance time for NotReady nodes is 5min, which means that after the actual node exception for 5min40s (340s), the Pod on the node will be evicted.
2. Adjust the Time the Node is Marked Unhealthy
The ControllerManager parameter --node-monitor-grace-period
controls the maximum allowed unresponsive duration before marking a node unhealthy. The default value of this parameter is 40s, and it must be N times larger than Kubelet's nodeStatusUpdateFrequency
parameter (the time interval for Kubelet to report the node status to the master node); where N refers to the number of retries Kubelet sends node status.
If you need to modify this parameter, please perform the following operations on each of the three Master nodes:
-
Add the parameter
--node-monitor-grace-period=20s
to the ControllerManager configuration file/etc/kubernetes/controller-manager
to adjust the tolerance time for marking a node unhealthy to 20s, back up the configuration file before modifying; -
Run
systemctl restart kube-controller-manager
to restart ControllerManager; -
Run
systemctl status kube-controller-manager
to confirm that the ControllerManager status isactive
.
3. Adjust the Pod's Tolerance Time for Unhealthy Nodes
When creating a Pod, if not specifically specified, the node controller will add the following taints to the Pod:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
This automatically added tolerance means that when one of the issues (NotReady / UnReachable) is detected, the Pod can continue to stay and run on the current node for 5 minutes by default.
Note: When Pods in DaemonSet are created, NoExecute tolerations added automatically for unreachable / not-ready taints won't specify tolerationSeconds, ensuring that Pods in DaemonSet will never be evicted when the corresponding issue occurs.
3.1 Adjust Default Tolerance Duration
The tolerance duration for unreachable / not-ready taints that Kubernetes automatically adds to Pods is controlled by related parameters in the APIServer. If you need to modify it,please perform the following operations on each of the three Master nodes:
-
Add the parameters
--default-not-ready-toleration-seconds=100
and--default-unreachable-toleration-seconds=100
to the APIServer configuration file/etc/kubernetes/apiserver
to adjust the tolerance time (in seconds, 300 by default) for the NotReady:NoExecute and Unreachable:NoExecute taints to 100s, back up the configuration file before modifying; -
Run
systemctl restart kube-apiserver
to restart APIServer. -
Run
systemctl status kube-apiserver
to confirm that the APIServer status isactive
.
3.2 Adjust Existing Pod Tolerance Duration
Taking the Pod created through Deployment as an example, we need to modify the Tolerations parameter in the existing Deployment using the kubectl patch
command.
First, create the patch file tolerationseconds.yaml, as shown in the example:
spec:
template:
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
# Adjust the Pod's tolerance time for Unreachable:NoExecute taint to 100s
tolerationSeconds: 100
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
# Adjust the Pod's tolerance time for NotReady:NoExecute taint to 100s
tolerationSeconds: 100
Then run the command kubectl patch deploy your-deployment --patch "$(cat tolerationseconds.yaml)"
to modify the Deployment. After the modification, you will find that the tolerance duration of the corresponding taint in the Pod controlled by this Deployment has been modified.
⚠️ This operation will cause Deployment to rebuild all Pods, so please do it during the low point of business.