Taints and tolerations

A node can decline pods by taints unless pods tolerate all the taints a node has. Taints are applied to nodes, while tolerations are specific to pods. A taint is a triplet with the form key=value:effect, and the effect could be PreferNoSchedule, NoScheduleor NoExecute.

Suppose that we have a node with some running pods and those running pods don't have the toleration on a taint, k_1=v_1, and different effects result in the following conditions:

  • NoSchedule: No new pods without tolerating k_1=v_1 will be placed on the node
  • PreferNoSchedule: The scheduler would try not to place new pods without tolerating k_1=v_1 to the node
  • NoExecute: The running pods would be repelled immediately or after a period that is specified in the pod's tolerationSeconds has passed

Let's see an example. Here, we have three nodes:

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION

gke-mycluster-default-pool-1e3873a1-jwvd Ready <none> 2m v1.11.2-gke.18
gke-mycluster-default-pool-a1eb51da-fbtj Ready <none> 2m v1.11.2-gke.18
gke-mycluster-default-pool-ec103ce1-t0l7 Ready <none> 2m v1.11.2-gke.18

Run a nginx pod:

$ kubectl run --generator=run-pod/v1 --image=nginx:1.15 ngx
pod/ngx created

$ kubectl describe pods ngx
Name: ngx
Node: gke-mycluster-default-pool-1e3873a1-jwvd/10.132.0.4
...
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

By the pod description, we can see it's been put on the gke-mycluster-default-pool-1e3873a1-jwvd node, and it has two default tolerations. Literally, this means if the node becomes not ready or unreachable, we have to wait for 300 seconds before the pod is evicted from the node. These two tolerations are applied by the DefaultTolerationSeconds admission controller plugin. Now, we add a taint to the node with NoExecute::

$ kubectl taint nodes gke-mycluster-default-pool-1e3873a1-jwvd 
experimental=true:NoExecute
node/gke-mycluster-default-pool-1e3873a1-jwvd tainted

Since our pod doesn't tolerate experimental=true and the effect is NoExecute, the pod will be evicted from the node immediately and restarted somewhere if it's managed by controllers. Multi-taints can also be applied to a node. The pods must match all the tolerations to run on that node. The following is an example that could pass the tainted node:

$ cat chapter8/8-3_management/pod_tolerations.yml
apiVersion: v1
kind: Pod
metadata:
name: pod-with-tolerations
spec:
containers:
- name: web
image: nginx
tolerations:
- key: "experimental"
value: "true"
operator: "Equal"
effect: "NoExecute"
$ kubectl apply -f
chapter8/8-3_management/pod_tolerations.yml
pod/pod-with-tolerations created

$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
pod-with-tolerations 1/1 Running 0 7s 10.32.1.4 gke-mycluster-default-pool-1e3873a1-jwvd

As we can see, the new pod can now run on the tainted node, gke-mycluster-default-pool-1e3873a1-jwvd.

As well as the Equal operator, we can also use Exists. In that case, we don't need to specify the value field. As long as the node is tainted with the specified key and the desired effect matches, the pod is eligible to run on that tainted node.

According to a node's running status, some taints could be populated by the node controller, kubelet, cloud providers, or cluster admins to move pods from the node. These taints are as follows:

  • node.kubernetes.io/not-ready
  • node.kubernetes.io/unreachable
  • node.kubernetes.io/out-of-disk
  • node.kubernetes.io/memory-pressure
  • node.kubernetes.io/disk-pressure
  • node.kubernetes.io/network-unavailable
  • node.cloudprovider.kubernetes.io/uninitialized
  • node.kubernetes.io/unschedulable

If there's any critical pod that needs to be run even under those circumstances, we should explicitly tolerate the corresponding taints. For example, pods managed by DaemonSet will tolerate the following taints in NoSchedule:

  • node.kubernetes.io/memory-pressure
  • node.kubernetes.io/disk-pressure
  • node.kubernetes.io/out-of-disk
  • node.kubernetes.io/unschedulable
  • node.kubernetes.io/network-unavailable

For node administrations, we can utilize kubectl cordon <node_name> to taint the node as unschedulable (node.kubernetes.io/unschedulable:NoSchedule), and use kubectl uncordon <node_name> to revert the action. Another command, kubectl drain, would evict pods on the node and also mark the node as unschedulable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset