Adjustment of Pod’s Tolerance Time for Node Exceptions

1. Principle Explanation

After a node in the Kubernetes cluster is in an abnormal state, there needs to be a waiting time before the Pods on the node are evicted. So for some key businesses, can this time be adjusted to promptly evict the Pod and rebuild it on other healthy nodes when the node encounter an exception?

To solve this problem, we first need to understand the mechanism of Kubernetes evicting Pods when nodes are in abnormal state.

In Kubernetes 1.13 and later versions, the two feature gates, TaintBasedEvictions and TaintNodesByCondition, are enabled by default. The lifecycle management of nodes and their Pods is done through the node’s Condition and Taint. Kubernetes keeps on checking the status of all nodes, setting corresponding Conditions, and setting corresponding Taints for nodes based on Condition, and then evicting Pods on the node based on Taint.

At the same time, when creating a Pod, the tolerationSeconds parameter is added to the Pod by default, specifying how long the Pod will continue to run on this node when the node goes abnormal (such as NotReady).

So, the time from a node being abnormal to a Pod being evicted is determined by two parameters: 1. The time from the actual node exception to being judged unhealthy; 2. The Pod’s tolerance time for unhealthy nodes.

In the Kubernetes cluster, the default time from the actual node exception to being judged unhealthy is 40s, and the Pod’s tolerance time for NotReady nodes is 5min, which means that after the actual node exception for 5min40s (340s), the Pod on the node will be evicted.

2. Adjust the Time the Node is Marked Unhealthy

The ControllerManager parameter --node-monitor-grace-period controls the maximum allowed unresponsive duration before marking a node unhealthy. The default value of this parameter is 40s, and it must be N times larger than Kubelet’s nodeStatusUpdateFrequency parameter (the time interval for Kubelet to report the node status to the master node); where N refers to the number of retries Kubelet sends node status.

If you need to modify this parameter, please perform the following operations on each of the three Master nodes:

Add the parameter --node-monitor-grace-period=20s to the ControllerManager configuration file /etc/kubernetes/controller-manager to adjust the tolerance time for marking a node unhealthy to 20s, back up the configuration file before modifying;
Run systemctl restart kube-controller-manager to restart ControllerManager;
Run systemctl status kube-controller-manager to confirm that the ControllerManager status is active.

3. Adjust the Pod’s Tolerance Time for Unhealthy Nodes

When creating a Pod, if not specifically specified, the node controller will add the following taints to the Pod:


tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300

This automatically added tolerance means that when one of the issues (NotReady / UnReachable) is detected, the Pod can continue to stay and run on the current node for 5 minutes by default.

Note: When Pods in DaemonSet are created, NoExecute tolerations added automatically for unreachable / not-ready taints won’t specify tolerationSeconds, ensuring that Pods in DaemonSet will never be evicted when the corresponding issue occurs.

3.1 Adjust Default Tolerance Duration

The tolerance duration for unreachable / not-ready taints that Kubernetes automatically adds to Pods is controlled by related parameters in the APIServer. If you need to modify it,please perform the following operations on each of the three Master nodes:

Add the parameters --default-not-ready-toleration-seconds=100 and --default-unreachable-toleration-seconds=100 to the APIServer configuration file /etc/kubernetes/apiserver to adjust the tolerance time (in seconds, 300 by default) for the NotReady:NoExecute and Unreachable:NoExecute taints to 100s, back up the configuration file before modifying;
Run systemctl restart kube-apiserver to restart APIServer.
Run systemctl status kube-apiserver to confirm that the APIServer status is active.

3.2 Adjust Existing Pod Tolerance Duration

Taking the Pod created through Deployment as an example, we need to modify the Tolerations parameter in the existing Deployment using the kubectl patch command.

First, create the patch file tolerationseconds.yaml, as shown in the example:


spec:
  template:
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        # Adjust the Pod's tolerance time for Unreachable:NoExecute taint to 100s
        tolerationSeconds: 100
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        # Adjust the Pod's tolerance time for NotReady:NoExecute taint to 100s
        tolerationSeconds: 100

Then run the command kubectl patch deploy your-deployment --patch "$(cat tolerationseconds.yaml)" to modify the Deployment. After the modification, you will find that the tolerance duration of the corresponding taint in the Pod controlled by this Deployment has been modified.

⚠️ This operation will cause Deployment to rebuild all Pods, so please do it during the low point of business.