Taints and tolerations

A node can decline pods by taints unless pods tolerate all the taints a node has. Taints are applied to nodes, while tolerations are specific to pods. A taint is a triplet with the form key=value:effect, and the effect could be PreferNoSchedule, NoSchedule, or NoExecute.

Suppose that we have a node with some running pods and those running pods don't have the toleration on a taint, k_1=v_1, and different effects result in the following conditions:

NoSchedule: No new pods without tolerating k_1=v_1 will be placed on the node
PreferNoSchedule: The scheduler would try not to place new pods without tolerating k_1=v_1 to the node
NoExecute: The running pods would be repelled immediately or after a period that is specified in the pod's tolerationSeconds has passed

Let's see an example. Here, we have three nodes:

$ kubectl get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
gke-mycluster-default-pool-1e3873a1-jwvd   Ready    <none>   2m    v1.11.2-gke.18
gke-mycluster-default-pool-a1eb51da-fbtj   Ready    <none>   2m    v1.11.2-gke.18
gke-mycluster-default-pool-ec103ce1-t0l7   Ready    <none>   2m    v1.11.2-gke.18

Run a nginx pod:

$ kubectl run --generator=run-pod/v1 --image=nginx:1.15 ngx
pod/ngx created

$ kubectl describe pods ngx
Name:               ngx
Node:               gke-mycluster-default-pool-1e3873a1-jwvd/10.132.0.4
...
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

By the pod description, we can see it's been put on the gke-mycluster-default-pool-1e3873a1-jwvd node, and it has two default tolerations. Literally, this means if the node becomes not ready or unreachable, we have to wait for 300 seconds before the pod is evicted from the node. These two tolerations are applied by the DefaultTolerationSeconds admission controller plugin. Now, we add a taint to the node with NoExecute::

$ kubectl taint nodes gke-mycluster-default-pool-1e3873a1-jwvd 
  experimental=true:NoExecute
node/gke-mycluster-default-pool-1e3873a1-jwvd tainted

Since our pod doesn't tolerate experimental=true and the effect is NoExecute, the pod will be evicted from the node immediately and restarted somewhere if it's managed by controllers. Multi-taints can also be applied to a node. The pods must match all the tolerations to run on that node. The following is an example that could pass the tainted node:

$ cat chapter8/8-3_management/pod_tolerations.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-tolerations
spec:
  containers:
  - name: web
    image: nginx
  tolerations:
  - key: "experimental"
    value: "true"
    operator: "Equal"
    effect: "NoExecute"
$ kubectl apply -f chapter8/8-3_management/pod_tolerations.yml
pod/pod-with-tolerations created

$ kubectl get pod -o wide
NAME                 READY STATUS  RESTARTS AGE IP        NODE                
pod-with-tolerations 1/1   Running 0        7s  10.32.1.4 gke-mycluster-default-pool-1e3873a1-jwvd

As we can see, the new pod can now run on the tainted node, gke-mycluster-default-pool-1e3873a1-jwvd.

As well as the Equal operator, we can also use Exists. In that case, we don't need to specify the value field. As long as the node is tainted with the specified key and the desired effect matches, the pod is eligible to run on that tainted node.

According to a node's running status, some taints could be populated by the node controller, kubelet, cloud providers, or cluster admins to move pods from the node. These taints are as follows:

node.kubernetes.io/not-ready
node.kubernetes.io/unreachable
node.kubernetes.io/out-of-disk
node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/network-unavailable
node.cloudprovider.kubernetes.io/uninitialized
node.kubernetes.io/unschedulable

If there's any critical pod that needs to be run even under those circumstances, we should explicitly tolerate the corresponding taints. For example, pods managed by DaemonSet will tolerate the following taints in NoSchedule:

node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/out-of-disk
node.kubernetes.io/unschedulable
node.kubernetes.io/network-unavailable

For node administrations, we can utilize kubectl cordon <node_name> to taint the node as unschedulable (node.kubernetes.io/unschedulable:NoSchedule), and use kubectl uncordon <node_name> to revert the action. Another command, kubectl drain, would evict pods on the node and also mark the node as unschedulable.

Table of Contents for Taints and tolerations

Create new playlist

Sign In

Sign Up

Table of Contents for
Taints and tolerations