At the beginning of the book, in Chapter 2, Kubernetes Architecture – From Docker Images to Running Pods, we explained the principles behind the Kubernetes scheduler (kube-scheduler) control plane component and its crucial role in the cluster. In short, its responsibility is to schedule container workloads (Kubernetes Pods) and assign them to healthy worker Nodes that fulfill the criteria required for running a particular workload.
This chapter will cover how you can control the criteria for scheduling Pods in the cluster. We will especially dive deeper into Node affinity, taints, and tolerations for Pods. We will also take a closer look at scheduling policies, which give kube-scheduler flexibility in how it prioritizes Pod workloads. You will find all of these concepts important in running production clusters at cloud scale.
In this chapter, we will cover the following topics:
For this chapter, you will need the following:
Basic Kubernetes cluster deployment (local and cloud-based) and kubectl installation have been covered in Chapter 3, Installing Your First Kubernetes Cluster.
The following previous chapters can give you an overview of how to deploy a fully functional Kubernetes cluster on different cloud platforms:
You can download the latest code samples for this chapter from the official GitHub repository: https://github.com/PacktPublishing/The-Kubernetes-Bible/tree/master/Chapter19.
In Kubernetes clusters, kube-scheduler is a component of the control plane that runs on Master Nodes. The main responsibility of this component is scheduling container workloads (Pods) and assigning them to healthy worker Nodes that fulfill the criteria required for running a particular workload. To recap, a Pod is a group of one or more containers with a shared network and storage and is the smallest deployment unit in the Kubernetes system. You usually use different Kubernetes controllers, such as Deployment objects and StatefulSet objects, to manage your Pods, but it is kube-scheduler that eventually assigns the created Pods to particular Nodes in the cluster.
Important note
For managed Kubernetes clusters in the cloud, such as the managed Azure Kubernetes Service or the Amazon Elastic Kubernetes Service, you normally do not have access to the Master Nodes, as they are managed by the cloud service provider for you. This means you will not have access to kube-scheduler itself, and usually, you cannot control its configuration, such as scheduling policies. But you can control all parameters for Pods that influence their scheduling.
Kube-scheduler queries the Kubernetes API Server (kube-apiserver) at a regular interval in order to list the Pods that have not been scheduled. At creation, Pods are marked as not scheduled – this means no worker Node was elected to run them. A Pod that is not scheduled will be registered in the etcd cluster state but without any worker Node assigned to it, and thus, no running kubelet will be aware of this Pod. Ultimately, no container described in the Pod specification will run at this point.
Internally, the Pod object, as it is stored in etcd, has a property called nodeName. As the name suggests, this property should contain the name of the worker Node that will host the Pod. When this property is set, we say the Pod is in a scheduled state, otherwise, the Pod is in a pending state.
We need to find a way to fill this value, and this is the role of the kube-scheduler. For this, the kube-scheduler poll continues the kube-apiserver at a regular interval. It looks for Pod resources with an empty nodeName property. Once it finds such Pods, it will execute an algorithm to elect a worker Node and will update the nodeName property in the Pod object, by issuing an HTTP request to the kube-apiserver. When electing a worker Node, the kube-scheduler will take into account its internal scheduling policies and criteria that you defined for the Pods. Finally, the kubelet which is responsible for running Pods on the selected worker Node will notice that there is a new Pod in the scheduled state for the Node and will attempt starting the Pod. These principles have been visualized in the following diagram:
The scheduling process for a Pod is performed in two phases:
The kube-scheduler will consider criteria and configuration values you can optionally pass in the Pod specification. By using these configurations, you can control precisely how the kube-scheduler will elect a worker Node.
Important note
The decisions of kube-scheduler are valid precisely at the point in time of scheduling the Pod. When the Pod gets scheduled and is running, kube-scheduler will not do any rescheduling operations while it is running (which can be days or even months). So even if this Pod no longer matches the Node according to your rules, it will remain running. Rescheduling will only happen if the Pod is terminated and a new Pod needs to be scheduled.
In the next sections, we will discuss the following configurations to control the scheduling of Pods:
Let's first take a look at Node affinity, together with Node name and Node selector.
To better understand how Node affinity works in Kubernetes, we need first to take a look at the most basic scheduling options, which are using Node name and Node selector for Pods.
As we mentioned before, each Pod object has a nodeName field which is usually controlled by the kube-scheduler. Nevertheless, it is possible to set this property directly in the YAML manifest when you create a Pod or create a controller that uses a Pod template. This is the simplest form of statically scheduling Pods on a given Node and is generally not recommended – it is not flexible and does not scale at all. The names of Nodes can change over time and you risk running out of resources on the Node.
Tip
You may find setting nodeName explicitly useful in debugging scenarios when you want to run a Pod on a specific Node.
We are going to demonstrate all scheduling principles on an example Deployment object that we introduced in Chapter 11, Deployment – Deploying Stateless Applications. This is a simple Deployment that manages five Pod replicas of an nginx webserver running in a container. Create the following YAML manifest named nginx-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
replicas: 5
selector:
matchLabels:
app: nginx
environment: test
template:
metadata:
labels:
app: nginx
environment: test
spec:
containers:
- name: nginx
image: nginx:1.17
ports:
- containerPort: 80
At this point, the Pod template in .spec.template.spec does not contain any configuration that affects the scheduling of the Pod replicas. Before we apply the manifest to the cluster, we need to know what Nodes we have in the cluster so that we can understand how they are scheduled and how we can influence the scheduling of Pods. You can get the list of Nodes using the kubectl get nodes command:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-77120516-vmss000000 Ready agent 1d v1.18.14
aks-nodepool1-77120516-vmss000001 Ready agent 1d v1.18.14
aks-nodepool1-77120516-vmss000002 Ready agent 1d v1.18.14
In our example, we are running a three-Node cluster. For simplicity, we will refer to aks-nodepool1-77120516-vmss000000 as Node 0, aks-nodepool1-77120516-vmss000001 as Node 1, and aks-nodepool1-77120516-vmss000002 as Node 2.
Now, let's apply the nginx-deployment.yaml YAML manifest to the cluster:
$ kubectl apply -f ./nginx-deployment.yaml
deployment.apps/nginx-deployment-example created
The Deployment object will create five Pod replicas. You can get their statuses, together with the Node names that they were scheduled for, using the following command:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-5549875c78-nndb4 Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-5549875c78-ps7pd Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-5549875c78-s824f Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-5549875c78-xfbkj Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-5549875c78-zg2w7 Running aks-nodepool1-77120516-vmss000000
As you can see, by default the Pods have been distributed uniformly – Node 0 has received two Pods, Node 1 one Pod, and Node 2 two Pods. This is a result of the default scheduling policies enabled in the kube-scheduler for filtering and scoring.
Tip
If you are running a non-managed Kubernetes cluster, you can inspect the logs for the kube-scheduler Pod using the kubectl logs command, or even directly at the master Nodes in /var/log/kube-scheduler.log. This may also require increased verbosity of logs for the kube-scheduler process. You can read more at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/.
We will now forcefully assign all Pods in the Deployment to Node 0 in the cluster using the nodeName field in the Pod template. Change the nginx-deployment.yaml YAML manifest so that it has this property set with the correct Node name for your cluster:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
...
template:
...
spec:
nodeName: aks-nodepool1-77120516-vmss000000
...
Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-6977595df5-95sfh Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6977595df5-cxgqb Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6977595df5-h5wwk Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6977595df5-pww9g Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6977595df5-q5xxs Running aks-nodepool1-77120516-vmss000000
As expected, all five Pods are now running on Node 0. These are all new Pods – when you change the Pod template in the Deployment specification, it causes internally a rollout using a new ReplicaSet object, while the old ReplicaSet object is scaled down, as explained in Chapter 11, Deployment – Deploying Stateless Applications.
Important note
In this way, we have actually bypassed kube-scheduler. If you inspect events for one of the Pods using the kubectl describe pod command, you will see that it lacks any events with Scheduled as a reason.
Next, we are going to take a look at another basic method of scheduling Pods, which is the Node selector.
Pod specification has a special field, .spec.nodeSelector, that gives you the ability to schedule your Pod only on Nodes that have certain label values. This concept is similar to label selectors that you know from Deployments or StatefulSets, but the difference is that it allows only simple equality-based comparisons for labels. You cannot do advanced set-based logic.
A very common use case for scheduling Pods using nodeSelector is managing Pods in hybrid Windows/Linux clusters. Every Kubernetes Node comes by default with a set of labels, which include the following:
If you inspect the labels for one of the Nodes, you will see that there are plenty of them – in our case some of them are specific to Azure Kubernetes Service (AKS) clusters only:
$ kubectl describe node aks-nodepool1-77120516-vmss000000
...
Labels: agentpool=nodepool1
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=Standard_DS2_v2
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eastus
failure-domain.beta.kubernetes.io/zone=0
kubernetes.azure.com/cluster=MC_k8sforbeginners-rg_k8sforbeginners-aks_eastus
kubernetes.azure.com/mode=system
kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2-2021.02.17
kubernetes.azure.com/role=agent
kubernetes.io/arch=amd64
kubernetes.io/hostname=aks-nodepool1-77120516-vmss000000
kubernetes.io/os=linux
kubernetes.io/role=agent
node-role.kubernetes.io/agent=
node.kubernetes.io/instance-type=Standard_DS2_v2
storageprofile=managed
storagetier=Premium_LRS
topology.kubernetes.io/region=eastus
topology.kubernetes.io/zone=0
...
Of course, you can define your own labels for the Nodes and use them to control scheduling. Please note that in general you should use semantic labeling for your resources in Kubernetes, rather than give them special labels just for the purpose of scheduling. Let's demonstrate how to do that by following these steps:
$ kubectl label nodes aks-nodepool1-77120516-vmss000001 node-type=superfast
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes aks-nodepool1-77120516-vmss000002 node-type=superfast
node/aks-nodepool1-77120516-vmss000002 labeled
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
...
template:
...
spec:
nodeSelector:
node-type: superfast
...
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-8485bc9569-2pm5h Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-79gn9 Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-df6x8 Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-fd4gv Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-tlxgl Running aks-nodepool1-77120516-vmss000002
As you can see, Node 1 has been assigned with two Pods and Node 2 with three Pods. The Pods have been distributed among Nodes that have the node-type=superfast label.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
...
template:
...
spec:
nodeSelector:
node-type: slow
...
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-54dbf4699f-jdx42 Pending <none>
nginx-deployment-example-54dbf4699f-sk2jd Pending <none>
nginx-deployment-example-54dbf4699f-xjdp2 Pending <none>
nginx-deployment-example-8485bc9569-2pm5h Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-df6x8 Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-fd4gv Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-tlxgl Running aks-nodepool1-77120516-vmss000002
The reason why three new Pods are pending and four old Pods are still running is the default configuration of rolling updates in the Deployment object. By default, maxSurge is set to 25% of Pod replicas (absolute number is rounded up), so in our case, it is two Pods allowed to be created above the desired number of five Pods. In total, we now have seven Pods. At the same time, maxUnavailable is also 25% of Pod replicas (but absolute number is rounded down), so in our case, one Pod out of five can be not available. In other words, four Pods must be Running. And because the new Pending Pods cannot get a Node in the process of scheduling, the Deployment is stuck waiting and not progressing. Normally, in this case you need to either perform a rollback to the previous version for the Deployment or change nodeSelector to one that matches existing Nodes properly. Of course, there is also an alternative of adding a new Node with matching labels or adding missing labels to the existing ones, without performing a rollback.
We will now continue the topic of scheduling Pods with the first of more advanced techniques: Node affinity.
The concepts of Node affinity expand the nodeSelector approach and provide a richer language for defining which Nodes are preferred or avoided for your Pod. In everyday life, the word affinity describes a natural liking for and understanding of someone or something, and this best describes the purpose of Node affinity for Pods. That is, you can control which Nodes your Pod will be attracted to or repelled by.
With Node affinity, represented in .spec.affinity.nodeAffinity for the Pod, you get the following enhancements over simple nodeSelector:
Tip
Even though there is no Node anti-affinity field provided by a separate field in spec, as in the case of inter-Pod anti-affinity you can still achieve similar results by using the NotIn and DoesNotExist operators. In this way, you can make Pods be repelled from Nodes with specific labels, also in a soft way.
The use cases and scenarios for defining the Node affinity and inter-Pod affinity/anti-affinity rules are unlimited. It is possible to express all kinds of requirements in this way, provided that you have enough labeling on the Nodes. For example, you can model requirements like scheduling the Pod only on a Windows Node with an Intel CPU and premium storage in the West Europe region but currently not running Pods for MySQL, or try not to schedule the Pod in availability Zone 1, but if it is not possible, then availability Zone 1 is still OK.
To demonstrate Node affinity, we will try to model the following requirements for our Deployment: "Try to schedule the Pod only on Nodes with a node-type label with a fast or superfast value, but if it this not possible, use any Node but strictly not with a node-type label with an extremelyslow value." For this, we need to use:
In our cluster we are going to first have the following labeling for Nodes:
As you can see, according to our requirements the Deployment Pods should be scheduled on Node 1 and Node 2, unless there is something preventing them from being allocated there, like a lack of CPU or memory resources. In that case, Node 0 would also be allowed as we use the soft affinity rule.
Next, we will relabel the Nodes in the following way:
Subsequently, we will need to redeploy our Deployment (for example, scale it down to zero and up to the original replica count, or use the kubectl rollout restart command) to reschedule the Pods again. After that, looking at our requirements, kube-scheduler should assign all Pods to Node 0 (because it is still allowed by the soft rule) but avoid at all costs Node 1 and Node 2. If by any chance Node 0 has no resources to run the Pod, then the Pods would be stuck in the Pending state.
Tip
To solve the issue of rescheduling already running Pods (in other words, to make kube-scheduler consider them again), there is an incubating Kubernetes project named Descheduler. You can find out more here: https://github.com/kubernetes-sigs/descheduler.
To do the demonstration, please follow these steps:
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000000 node-type=slow
node/aks-nodepool1-77120516-vmss000000 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000001 node-type=fast
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000002 node-type=superfast
node/aks-nodepool1-77120516-vmss000002 not labeled # Note that this label was already present with this value
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
...
template:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: NotIn
values:
- extremelyslow
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node-type
operator: In
values:
- fast
- superfast
...
As you can see, we have used nodeAffinity (not podAffinity or podAntiAffinity) with preferredDuringSchedulingIgnoredDuringExecution set so that it has only one soft rule: node-type should have a fast value or a superfast value. This means that if there are no resources on such Nodes, they can still be scheduled on other Nodes. Additionally, we specify one hard anti-affinity rule in requiredDuringSchedulingIgnoredDuringExecution, which says that node-type must not be extremelyslow. You can find the full specification of Pod's .spec.affinity in the official documentation: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#affinity-v1-core.
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-7ff6c65bd4-8z7z5 Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-7ff6c65bd4-ps9md Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-7ff6c65bd4-pszkq Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-7ff6c65bd4-qpv5d Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-7ff6c65bd4-vh6dx Running aks-nodepool1-77120516-vmss000002
Our Node affinity rules were defined to prefer Nodes that have node-type set to either fast or superfast, and indeed the Pods were scheduled for Node 1 and Node 2 only.
Now we will do an experiment to demonstrate how the soft part of Node affinity together with the hard part of Node anti-affinity work. We will relabel the Nodes as described in the introduction, redeploy the Deployment, and observe what happens. Please follow these steps:
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000000 node-type=slow
node/aks-nodepool1-77120516-vmss000000 not labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000001 node-type=extremelyslow
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000002 node-type=extremelyslow
node/aks-nodepool1-77120516-vmss000002 labeled
$ kubectl rollout restart deploy nginx-deployment-example
deployment.apps/nginx-deployment-example restarted
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-6c4fdd447d-4mjfm Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-qgqmc Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-qhrtf Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-tnvpm Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-ttfnk Running aks-nodepool1-77120516-vmss000000
The output shows that, as expected, all Pods have been scheduled to Node 0, which is labeled with node-type=slow. We allow such Nodes if there is nothing better, and in this case Node 1 and Node 2 have label node-type=extremelyslow, which is prohibited by the hard Node anti-affinity rule.
Tip
To achieve even higher granularity and control of Pod scheduling, you can use Pod topology spread constraints. More details are available in the official documentation: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/.
Congratulations, you have successfully configured Node affinity for our Deployment Pods! We will now explore another way of scheduling Pods – taints and tolerations.
Using the Node and inter-Pod affinity mechanism for scheduling Pods is very powerful, but sometimes you need a simpler way of specifying which Nodes should repel Pods. Kubernetes offers a slightly older and simpler feature for this purpose – taints and tolerations. You apply a taint to a given Node (which describes some kind of limitation) and the Pod must have a specific toleration defined to be schedulable on the tainted Node. If the Pod has a toleration, it does not mean that the taint is required on the Node. The real-life definition of taint is "a trace of a bad or undesirable substance or quality," and this reflects the idea pretty well – all Pods will avoid a Node if there is a taint set for them, but we can instruct Pods to tolerate a specific taint.
Tip
If you look closely at how taints and tolerations are described, you can see that you can achieve similar results with Node labels and Node hard and soft affinity rules with the NotIn operator. There is one catch – you can define taints with a NoExecute effect which will result in the termination of the Pod if it cannot tolerate it. You cannot get similar results with affinity rules unless you restart the Pod manually.
Taints for Nodes have the following structure: <key>=<value>:<effect>. The key and value pair identifies the taint and can be used for more granular tolerations, for example tolerating all taints with a given key and any value. This is similar to labels, but please bear in mind that taints are separate properties, and defining a taint does not affect Node labels. In our example demonstration, we will use our own taint with a machine-check-exception key and a memory value. This is, of course, a theoretical example where we want to indicate that there is a hardware issue with memory on the host, but you could also have a taint with the same key and instead a cpu or disk value. In general, your taints should semantically label the type of issue that the Node is experiencing. There is nothing preventing you from using any keys and values for creating taints, but if they make semantic sense, it is much easier to manage them and define tolerations.
The taint can have different effects:
Kubernetes manages quite a few NoExecute taints automatically by monitoring the Node hosts. The following taints are built-in and managed by NodeController or the kubelet:
To add a taint on a Node, you use the kubectl taint node command in the following way:
$ kubectl taint node <nodeName> <key>=<value>:<effect>
So, for example, if we want to use key machine-check-exception and a memory value with a NoExecute effect for Node 1, we will use the following command:
$ kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception=memory:NoExecute
To remove the same taint, you need to use the following command (bear in mind the - character at the end of the taint definition):
$ kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception=memory:NoExecute-
You can also remove all taints with a specified key:
kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception:NoExecute-
To counteract the effect of the taint on a Node for specific Pods, you can define tolerations in their specification. In other words, you can use tolerations to ignore taints and still schedule the Pods to such Nodes. If a Node has multiple taints applied, the Pod must tolerate all of its taints. Tolerations are defined under .spec.tolerations in the Pod specification and have the following structure:
tolerations:
- key: <key>
operator: <operatorType>
value: <value>
effect: <effect>
The operator can be either Equal or Exists. Equal means that both key and value of taint must match exactly, whereas Exists means that just key must match and value is not considered. In our example, if we want to ignore the taint, the toleration will need to look like this:
tolerations:
- key: machine-check-exception
operator: Equal
value: memory
effect: NoExecute
You can define multiple tolerations for a Pod.
In the case of NoExecute tolerations, it is possible to define an additional field called tolerationSeconds, which specifies how long the Pod will tolerate the taint until it gets evicted. So, this is a way of having partial toleration of taint with a timeout. Please note that if you use NoExecute taints, you usually also need to add a NoSchedule taint. In this way, you can prevent any eviction loops happening when the Pod has a NoExecute toleration with tolerationSeconds set. This is because the taint has no effect for a specified number of seconds, which also includes not preventing the Pod from being scheduled for the tainted Node.
Important Note
When Pods are created in the cluster, Kubernetes automatically adds two Exists tolerations for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds set to 300.
We will now put this knowledge into practice with a few demonstrations. Please follow the next steps:
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoExecute
node/aks-nodepool1-77120516-vmss000000 tainted
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-6c4fdd447d-c42z2 Pending <none>
nginx-deployment-example-6c4fdd447d-dstbl Pending <none>
nginx-deployment-example-6c4fdd447d-ktfzh Pending <none>
nginx-deployment-example-6c4fdd447d-ptcwc Pending <none>
nginx-deployment-example-6c4fdd447d-wdmb9 Pending <none>
All Deployment Pods are now in the Pending state because kube-scheduler is unable to find a Node that can run them.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
...
template:
...
spec:
tolerations:
- key: machine-check-exception
operator: Equal
value: memory
effect: NoExecute
tolerationSeconds: 60
...
When this manifest is applied to the cluster, the old Node affinity rules which prevented scheduling to Node 1 and Node 2 will be gone. The Pods will be able to schedule on Node 1 and Node 2, but Node 0 has taint machine-check-exception=memory:NoExecute. So, the Pods should not be scheduled to Node 0, as NoExecute implies NoSchedule, right? Let's check that.
$ kubectl get pods -o wide
NAME ... AGE IP NODE
nginx-deployment-example-6b774d7f6c-95ttq ... 14s 10.244.1.230 aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6b774d7f6c-hthwj ... 16m 10.244.0.110 aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-lskr7 ... 14s 10.244.1.231 aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6b774d7f6c-q94kw ... 16m 10.244.2.19 aks-nodepool1-77120516-vmss000002
nginx-deployment-example-6b774d7f6c-wszfn ... 16m 10.244.0.109 aks-nodepool1-77120516-vmss000001
This result may be a bit surprising. As you can see, we got two Pods scheduled on Node 1 and one Pod on Node 2, but at the same time Node 0 has received two Pods, and they are in eviction loop every 60 seconds! The explanation for this is that tolerationSeconds for the NoExecute taint implies that the whole taint is ignored for 60 seconds. So kube-scheduler can schedule the Pod on Node 0, even though it will get evicted later.
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoSchedule
node/aks-nodepool1-77120516-vmss000000 tainted
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-6b774d7f6c-hthwj Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-jfvqn Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-q94kw Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-6b774d7f6c-wszfn Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-z8jx2 Running aks-nodepool1-77120516-vmss000002
In the output you can see that the Pods are now distributed between Node 1 and Node 2 – exactly as we wanted.
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception-
node/aks-nodepool1-77120516-vmss000000 untainted
$ kubectl rollout restart deploy nginx-deployment-example
deployment.apps/nginx-deployment-example restarted
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-56f4d4d96d-nf82h Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-v8m9c Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-56f4d4d96d-vzqn4 Running aks-nodepool1-77120516-vmss000000
nginx-deployment-example-56f4d4d96d-wpv78 Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-x7x92 Running aks-nodepool1-77120516-vmss000001
The Pods are again distributed evenly between all three Nodes.
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoSchedule
node/aks-nodepool1-77120516-vmss000000 tainted
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoExecute
node/aks-nodepool1-77120516-vmss000000 tainted
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME STATUS NODE
nginx-deployment-example-56f4d4d96d-44zvt Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-9rg2p Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-nf82h Running aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-wpv78 Running aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-x7x92 Running aks-nodepool1-77120516-vmss000001
As we expected, the Pods have been evicted after 60 seconds and there were no eviction-schedule loops.
This has demonstrated a more advanced use case for taints which you cannot easily substitute with Node affinity rules. In the next section, we will give a short overview of kube-scheduler scheduling policies.
kube-scheduler decides for which Node a given Pod should be scheduled, in two phases: filtering and scoring. To quickly recap, filtering is the first phase when kube-scheduler finds a set of Nodes that can be used for the running of a Pod. For example, if a Pod tolerates Node taints. In the second phase, scoring, the filtered Nodes are ranked using a scoring system to find the most suitable Node for the Pod.
The way the default kube-scheduler executes these two phases is defined by the scheduling policy. This policy is configurable and can be passed to the kube-scheduler process using the additional arguments --policy-config-file <filename> or --policy-configmap <configMap>.
Important note
In managed Kubernetes clusters, such as the managed Azure Kubernetes Service, you will not be able to change scheduling policy of kube-scheduler, as you do not have access to Kubernetes master Node.
There are two configuration fields that are most important in scheduling policy:
The full list of currently supported predicates and priorities is available in the official documentation: https://kubernetes.io/docs/reference/scheduling/policies/. We will give an overview of a few of the most interesting ones that show how flexible the default kube-scheduler is. Some of the selected predicates are shown in the following list:
Some of the interesting available priorities are as follows:
The preceding examples are just a subset of the available predicates and priorities, but this already gives an overview of how many complex use cases and scenarios are supported out of the box in kube-scheduler.
This chapter has given an overview of advanced techniques for Pod scheduling in Kubernetes. First, we recapped the theory behind kube-scheduler implementation. We have explained the process of scheduling Pods. Next, we introduced the concept of Node affinity in Pod scheduling. You learned the basic scheduling methods which use Node names and Node selectors, and based on that we have explained how more advanced Node affinity works. We also explained how you can use the affinity concept to achieve anti-affinity, and what inter-Pod affinity/anti-affinity is. After that, we discussed taints for Nodes and tolerations specified by Pods. You learned about some different effects of the taints, and have put the knowledge into practice in an advanced use case involving NoExecute and NoSchedule taints on a Node. Lastly, we discussed the theory behind scheduler policies that can be used to configure the default kube-scheduler.
In the next chapter, we are going to discuss autoscaling of Pods and Nodes in Kubernetes – this will be a topic that will show how flexibly Kubernetes can run workloads in cloud environments.
For more information regarding Pod scheduling in Kubernetes, please refer to the following PacktPub books:
You can also refer to official documents: