Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 19: Advanced Techniques for Scheduling Pods

At the beginning of the book, in Chapter 2, Kubernetes Architecture – From Docker Images to Running Pods, we explained the principles behind the Kubernetes scheduler (kube-scheduler) control plane component and its crucial role in the cluster. In short, its responsibility is to schedule container workloads (Kubernetes Pods) and assign them to healthy worker Nodes that fulfill the criteria required for running a particular workload.

This chapter will cover how you can control the criteria for scheduling Pods in the cluster. We will especially dive deeper into Node affinity, taints, and tolerations for Pods. We will also take a closer look at scheduling policies, which give kube-scheduler flexibility in how it prioritizes Pod workloads. You will find all of these concepts important in running production clusters at cloud scale.

In this chapter, we will cover the following topics:

Refresher – What is kube-scheduler?
Managing Node affinity
Using Node taints and tolerations
Scheduling policies

Technical requirements

For this chapter, you will need the following:

Kubernetes cluster deployed. We recommend using a multi-node, cloud-based Kubernetes cluster. Having a multi-node cluster will make understanding Node affinity, taints, and tolerations much easier.
Kubernetes CLI (kubectl) installed on your local machine and configured to manage your Kubernetes cluster.

Basic Kubernetes cluster deployment (local and cloud-based) and kubectl installation have been covered in Chapter 3, Installing Your First Kubernetes Cluster.

The following previous chapters can give you an overview of how to deploy a fully functional Kubernetes cluster on different cloud platforms:

Chapter 14, Kubernetes Clusters on Google Kubernetes Engine
Chapter 15, Launching a Kubernetes Cluster on Amazon Web Services with the Amazon Elastic Kubernetes Service
Chapter 16, Kubernetes Clusters on Microsoft Azure with the Azure Kubernetes Service

You can download the latest code samples for this chapter from the official GitHub repository: https://github.com/PacktPublishing/The-Kubernetes-Bible/tree/master/Chapter19.

Refresher – What is kube-scheduler?

In Kubernetes clusters, kube-scheduler is a component of the control plane that runs on Master Nodes. The main responsibility of this component is scheduling container workloads (Pods) and assigning them to healthy worker Nodes that fulfill the criteria required for running a particular workload. To recap, a Pod is a group of one or more containers with a shared network and storage and is the smallest deployment unit in the Kubernetes system. You usually use different Kubernetes controllers, such as Deployment objects and StatefulSet objects, to manage your Pods, but it is kube-scheduler that eventually assigns the created Pods to particular Nodes in the cluster.

Important note

For managed Kubernetes clusters in the cloud, such as the managed Azure Kubernetes Service or the Amazon Elastic Kubernetes Service, you normally do not have access to the Master Nodes, as they are managed by the cloud service provider for you. This means you will not have access to kube-scheduler itself, and usually, you cannot control its configuration, such as scheduling policies. But you can control all parameters for Pods that influence their scheduling.

Kube-scheduler queries the Kubernetes API Server (kube-apiserver) at a regular interval in order to list the Pods that have not been scheduled. At creation, Pods are marked as not scheduled – this means no worker Node was elected to run them. A Pod that is not scheduled will be registered in the etcd cluster state but without any worker Node assigned to it, and thus, no running kubelet will be aware of this Pod. Ultimately, no container described in the Pod specification will run at this point.

Internally, the Pod object, as it is stored in etcd, has a property called nodeName. As the name suggests, this property should contain the name of the worker Node that will host the Pod. When this property is set, we say the Pod is in a scheduled state, otherwise, the Pod is in a pending state.

We need to find a way to fill this value, and this is the role of the kube-scheduler. For this, the kube-scheduler poll continues the kube-apiserver at a regular interval. It looks for Pod resources with an empty nodeName property. Once it finds such Pods, it will execute an algorithm to elect a worker Node and will update the nodeName property in the Pod object, by issuing an HTTP request to the kube-apiserver. When electing a worker Node, the kube-scheduler will take into account its internal scheduling policies and criteria that you defined for the Pods. Finally, the kubelet which is responsible for running Pods on the selected worker Node will notice that there is a new Pod in the scheduled state for the Node and will attempt starting the Pod. These principles have been visualized in the following diagram:

Figure 19.1 – Interactions of kube-scheduler and Kubernetes API server

The scheduling process for a Pod is performed in two phases:

Filtering: Kube-scheduler determines the set of Nodes that are capable of running a given Pod. This includes checking the actual state of the Nodes and verifying any resource requirements and criteria specified by the Pod definition. At this point, if there are no Nodes that can run a given Pod, the Pod cannot be scheduled and remains pending.
Scoring: Kube-scheduler assigns scores for each Node based on a set of scheduling policies. Then, the Pod is assigned by the scheduler to the Node with the highest score. We will cover scheduling policies in the last section of this chapter.

The kube-scheduler will consider criteria and configuration values you can optionally pass in the Pod specification. By using these configurations, you can control precisely how the kube-scheduler will elect a worker Node.

Important note

The decisions of kube-scheduler are valid precisely at the point in time of scheduling the Pod. When the Pod gets scheduled and is running, kube-scheduler will not do any rescheduling operations while it is running (which can be days or even months). So even if this Pod no longer matches the Node according to your rules, it will remain running. Rescheduling will only happen if the Pod is terminated and a new Pod needs to be scheduled.

In the next sections, we will discuss the following configurations to control the scheduling of Pods:

Node name and Node selector, which are the simplest forms of static scheduling.
Node affinity and inter-Pod affinity/anti-affinity.
Taints and tolerations.

Let's first take a look at Node affinity, together with Node name and Node selector.

Managing Node affinity

To better understand how Node affinity works in Kubernetes, we need first to take a look at the most basic scheduling options, which are using Node name and Node selector for Pods.

Pod Node name

As we mentioned before, each Pod object has a nodeName field which is usually controlled by the kube-scheduler. Nevertheless, it is possible to set this property directly in the YAML manifest when you create a Pod or create a controller that uses a Pod template. This is the simplest form of statically scheduling Pods on a given Node and is generally not recommended – it is not flexible and does not scale at all. The names of Nodes can change over time and you risk running out of resources on the Node.

Tip

You may find setting nodeName explicitly useful in debugging scenarios when you want to run a Pod on a specific Node.

We are going to demonstrate all scheduling principles on an example Deployment object that we introduced in Chapter 11, Deployment – Deploying Stateless Applications. This is a simple Deployment that manages five Pod replicas of an nginx webserver running in a container. Create the following YAML manifest named nginx-deployment.yaml:

apiVersion: apps/v1

kind: Deployment

metadata:

name: nginx-deployment-example

spec:

replicas: 5

selector:

matchLabels:

app: nginx

environment: test

template:

metadata:

labels:

app: nginx

environment: test

spec:

containers:

- name: nginx

image: nginx:1.17

ports:

- containerPort: 80

At this point, the Pod template in .spec.template.spec does not contain any configuration that affects the scheduling of the Pod replicas. Before we apply the manifest to the cluster, we need to know what Nodes we have in the cluster so that we can understand how they are scheduled and how we can influence the scheduling of Pods. You can get the list of Nodes using the kubectl get nodes command:

$ kubectl get nodes

NAME STATUS ROLES AGE VERSION

aks-nodepool1-77120516-vmss000000 Ready agent 1d v1.18.14

aks-nodepool1-77120516-vmss000001 Ready agent 1d v1.18.14

aks-nodepool1-77120516-vmss000002 Ready agent 1d v1.18.14

In our example, we are running a three-Node cluster. For simplicity, we will refer to aks-nodepool1-77120516-vmss000000 as Node 0, aks-nodepool1-77120516-vmss000001 as Node 1, and aks-nodepool1-77120516-vmss000002 as Node 2.

Now, let's apply the nginx-deployment.yaml YAML manifest to the cluster:

$ kubectl apply -f ./nginx-deployment.yaml

deployment.apps/nginx-deployment-example created

The Deployment object will create five Pod replicas. You can get their statuses, together with the Node names that they were scheduled for, using the following command:

$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"

NAME STATUS NODE

nginx-deployment-example-5549875c78-nndb4 Running aks-nodepool1-77120516-vmss000001

nginx-deployment-example-5549875c78-ps7pd Running aks-nodepool1-77120516-vmss000000

nginx-deployment-example-5549875c78-s824f Running aks-nodepool1-77120516-vmss000002

nginx-deployment-example-5549875c78-xfbkj Running aks-nodepool1-77120516-vmss000002

nginx-deployment-example-5549875c78-zg2w7 Running aks-nodepool1-77120516-vmss000000

As you can see, by default the Pods have been distributed uniformly – Node 0 has received two Pods, Node 1 one Pod, and Node 2 two Pods. This is a result of the default scheduling policies enabled in the kube-scheduler for filtering and scoring.

Tip

If you are running a non-managed Kubernetes cluster, you can inspect the logs for the kube-scheduler Pod using the kubectl logs command, or even directly at the master Nodes in /var/log/kube-scheduler.log. This may also require increased verbosity of logs for the kube-scheduler process. You can read more at https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/.

We will now forcefully assign all Pods in the Deployment to Node 0 in the cluster using the nodeName field in the Pod template. Change the nginx-deployment.yaml YAML manifest so that it has this property set with the correct Node name for your cluster:

apiVersion: apps/v1

kind: Deployment

metadata:

name: nginx-deployment-example

spec:

...

template:

...

spec:

nodeName: aks-nodepool1-77120516-vmss000000

...

Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again:

$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"

NAME STATUS NODE

nginx-deployment-example-6977595df5-95sfh Running aks-nodepool1-77120516-vmss000000

nginx-deployment-example-6977595df5-cxgqb Running aks-nodepool1-77120516-vmss000000

nginx-deployment-example-6977595df5-h5wwk Running aks-nodepool1-77120516-vmss000000

nginx-deployment-example-6977595df5-pww9g Running aks-nodepool1-77120516-vmss000000

nginx-deployment-example-6977595df5-q5xxs Running aks-nodepool1-77120516-vmss000000

As expected, all five Pods are now running on Node 0. These are all new Pods – when you change the Pod template in the Deployment specification, it causes internally a rollout using a new ReplicaSet object, while the old ReplicaSet object is scaled down, as explained in Chapter 11, Deployment – Deploying Stateless Applications.

Important note

In this way, we have actually bypassed kube-scheduler. If you inspect events for one of the Pods using the kubectl describe pod command, you will see that it lacks any events with Scheduled as a reason.

Next, we are going to take a look at another basic method of scheduling Pods, which is the Node selector.

Pod Node selector

Pod specification has a special field, .spec.nodeSelector, that gives you the ability to schedule your Pod only on Nodes that have certain label values. This concept is similar to label selectors that you know from Deployments or StatefulSets, but the difference is that it allows only simple equality-based comparisons for labels. You cannot do advanced set-based logic.

A very common use case for scheduling Pods using nodeSelector is managing Pods in hybrid Windows/Linux clusters. Every Kubernetes Node comes by default with a set of labels, which include the following:

kubernetes.io/arch: Describes the Node's processor architecture, for example, amd64 or arm. This is also defined as beta.kubernetes.io/arch.
kubernetes.io/os: Has a value of linux or windows. This is also defined as beta.kubernetes.io/os.

If you inspect the labels for one of the Nodes, you will see that there are plenty of them – in our case some of them are specific to Azure Kubernetes Service (AKS) clusters only:

$ kubectl describe node aks-nodepool1-77120516-vmss000000

...

Labels: agentpool=nodepool1

beta.kubernetes.io/arch=amd64

beta.kubernetes.io/instance-type=Standard_DS2_v2

beta.kubernetes.io/os=linux

failure-domain.beta.kubernetes.io/region=eastus

failure-domain.beta.kubernetes.io/zone=0

kubernetes.azure.com/cluster=MC_k8sforbeginners-rg_k8sforbeginners-aks_eastus

kubernetes.azure.com/mode=system

kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2-2021.02.17

kubernetes.azure.com/role=agent

kubernetes.io/arch=amd64

kubernetes.io/hostname=aks-nodepool1-77120516-vmss000000

kubernetes.io/os=linux

kubernetes.io/role=agent

node-role.kubernetes.io/agent=

node.kubernetes.io/instance-type=Standard_DS2_v2

storageprofile=managed

storagetier=Premium_LRS

topology.kubernetes.io/region=eastus

topology.kubernetes.io/zone=0

...

Of course, you can define your own labels for the Nodes and use them to control scheduling. Please note that in general you should use semantic labeling for your resources in Kubernetes, rather than give them special labels just for the purpose of scheduling. Let's demonstrate how to do that by following these steps:

Use the kubectl label nodes command to add a node-type label with a superfast value to Node 1 and Node 2 in the cluster:
$ kubectl label nodes aks-nodepool1-77120516-vmss000001 node-type=superfast
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes aks-nodepool1-77120516-vmss000002 node-type=superfast
node/aks-nodepool1-77120516-vmss000002 labeled
Edit the ./nginx-deployment.yaml deployment manifest so that it has nodeSelector in the Pod template set to node-type: superfast (nodeName that we used previously should be removed):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment-example
spec:
...
  template:
...
    spec:
      nodeSelector:
        node-type: superfast
...
Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again. You may need to wait a while for the deployment rollout to finish:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-8485bc9569-2pm5h   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-79gn9   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-df6x8   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-fd4gv   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-tlxgl   Running   aks-nodepool1-77120516-vmss000002
As you can see, Node 1 has been assigned with two Pods and Node 2 with three Pods. The Pods have been distributed among Nodes that have the node-type=superfast label.
In contrast, if you change the ./nginx-deployment.yaml manifest so that it has nodeSelector in the Pod template set to node-type: slow, which no Node in the cluster has assigned, we will see that Pods could not be scheduled and the deployment will be stuck. Edit the manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment-example
spec:
...
  template:
...
    spec:
      nodeSelector:
        node-type: slow
...
Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-54dbf4699f-jdx42   Pending   <none>
nginx-deployment-example-54dbf4699f-sk2jd   Pending   <none>
nginx-deployment-example-54dbf4699f-xjdp2   Pending   <none>
nginx-deployment-example-8485bc9569-2pm5h   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-df6x8   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-8485bc9569-fd4gv   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-8485bc9569-tlxgl   Running   aks-nodepool1-77120516-vmss000002
The reason why three new Pods are pending and four old Pods are still running is the default configuration of rolling updates in the Deployment object. By default, maxSurge is set to 25% of Pod replicas (absolute number is rounded up), so in our case, it is two Pods allowed to be created above the desired number of five Pods. In total, we now have seven Pods. At the same time, maxUnavailable is also 25% of Pod replicas (but absolute number is rounded down), so in our case, one Pod out of five can be not available. In other words, four Pods must be Running. And because the new Pending Pods cannot get a Node in the process of scheduling, the Deployment is stuck waiting and not progressing. Normally, in this case you need to either perform a rollback to the previous version for the Deployment or change nodeSelector to one that matches existing Nodes properly. Of course, there is also an alternative of adding a new Node with matching labels or adding missing labels to the existing ones, without performing a rollback.

We will now continue the topic of scheduling Pods with the first of more advanced techniques: Node affinity.

Node affinity configuration for Pods

The concepts of Node affinity expand the nodeSelector approach and provide a richer language for defining which Nodes are preferred or avoided for your Pod. In everyday life, the word affinity describes a natural liking for and understanding of someone or something, and this best describes the purpose of Node affinity for Pods. That is, you can control which Nodes your Pod will be attracted to or repelled by.

With Node affinity, represented in .spec.affinity.nodeAffinity for the Pod, you get the following enhancements over simple nodeSelector:

You get a richer language for expressing the rules for matching Pods to Nodes. For example, you can use In, NotIn, Exists, DoesNotExist, Gt, and Lt operators for labels.
Similarly to nodeAffinity, it is possible to do scheduling using inter-Pod affinity (podAffinity) and additionally anti-affinity (podAntiAffinity). Anti-affinity has an opposite effect to affinity – you can define rules that repel the Pods. In this way, you can make your Pods be attracted to Nodes that already run certain Pods. This is especially useful if you want to collocate Pods to decrease latency.
It is possible to define soft affinity and anti-affinity rules that represent a preference instead of a hard rule. In other words, if the scheduler can still schedule the Pod, even if it cannot match the soft rule. Soft rules are represented by the preferredDuringSchedulingIgnoredDuringExecution field in specification, whereas hard rules are represented by the requiredDuringSchedulingIgnoredDuringExecution field.
Soft rules can be weighted.
Tip
Even though there is no Node anti-affinity field provided by a separate field in spec, as in the case of inter-Pod anti-affinity you can still achieve similar results by using the NotIn and DoesNotExist operators. In this way, you can make Pods be repelled from Nodes with specific labels, also in a soft way.

The use cases and scenarios for defining the Node affinity and inter-Pod affinity/anti-affinity rules are unlimited. It is possible to express all kinds of requirements in this way, provided that you have enough labeling on the Nodes. For example, you can model requirements like scheduling the Pod only on a Windows Node with an Intel CPU and premium storage in the West Europe region but currently not running Pods for MySQL, or try not to schedule the Pod in availability Zone 1, but if it is not possible, then availability Zone 1 is still OK.

To demonstrate Node affinity, we will try to model the following requirements for our Deployment: "Try to schedule the Pod only on Nodes with a node-type label with a fast or superfast value, but if it this not possible, use any Node but strictly not with a node-type label with an extremelyslow value." For this, we need to use:

Soft Node affinity rule of type preferredDuringSchedulingIgnoredDuringExecution to match fast and superfast Nodes.
Hard Node affinity rule of type requiredDuringSchedulingIgnoredDuringExecution to repel the Pod strictly from Nodes with node-type as extremelyslow. We need to use the NotIn operator to get the anti-affinity effect.

In our cluster we are going to first have the following labeling for Nodes:

Node 0: slow
Node 1: fast
Node 2: superfast

As you can see, according to our requirements the Deployment Pods should be scheduled on Node 1 and Node 2, unless there is something preventing them from being allocated there, like a lack of CPU or memory resources. In that case, Node 0 would also be allowed as we use the soft affinity rule.

Next, we will relabel the Nodes in the following way:

Node 0: slow
Node 1: extremelyslow
Node 2: extremelyslow

Subsequently, we will need to redeploy our Deployment (for example, scale it down to zero and up to the original replica count, or use the kubectl rollout restart command) to reschedule the Pods again. After that, looking at our requirements, kube-scheduler should assign all Pods to Node 0 (because it is still allowed by the soft rule) but avoid at all costs Node 1 and Node 2. If by any chance Node 0 has no resources to run the Pod, then the Pods would be stuck in the Pending state.

Tip

To solve the issue of rescheduling already running Pods (in other words, to make kube-scheduler consider them again), there is an incubating Kubernetes project named Descheduler. You can find out more here: https://github.com/kubernetes-sigs/descheduler.

To do the demonstration, please follow these steps:

Use the kubectl label nodes command to add a node-type label with a slow value for Node 0, a fast value for Node 1, and a superfast value for Node 2:
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000000 node-type=slow
node/aks-nodepool1-77120516-vmss000000 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000001 node-type=fast
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000002 node-type=superfast
node/aks-nodepool1-77120516-vmss000002 not labeled # Note that this label was already present with this value
Edit the ./nginx-deployment.yaml Deployment manifest (the full file is available in the official GitHub repository for the book: https://github.com/PacktPublishing/Kubernetes-for-Beginners/blob/master/Chapter19/03_affinity/nginx-deployment.yaml), and remove nodeSelector. Instead, define the soft Node affinity rule as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment-example
spec:
...
  template:
...
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: NotIn
                values:
                - extremelyslow
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: node-type
                operator: In
                values:
                - fast
                - superfast
...
As you can see, we have used nodeAffinity (not podAffinity or podAntiAffinity) with preferredDuringSchedulingIgnoredDuringExecution set so that it has only one soft rule: node-type should have a fast value or a superfast value. This means that if there are no resources on such Nodes, they can still be scheduled on other Nodes. Additionally, we specify one hard anti-affinity rule in requiredDuringSchedulingIgnoredDuringExecution, which says that node-type must not be extremelyslow. You can find the full specification of Pod's .spec.affinity in the official documentation: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#affinity-v1-core.
Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again. You may need to wait a while for the Deployment rollout to finish:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-7ff6c65bd4-8z7z5   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-7ff6c65bd4-ps9md   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-7ff6c65bd4-pszkq   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-7ff6c65bd4-qpv5d   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-7ff6c65bd4-vh6dx   Running   aks-nodepool1-77120516-vmss000002
Our Node affinity rules were defined to prefer Nodes that have node-type set to either fast or superfast, and indeed the Pods were scheduled for Node 1 and Node 2 only.

Now we will do an experiment to demonstrate how the soft part of Node affinity together with the hard part of Node anti-affinity work. We will relabel the Nodes as described in the introduction, redeploy the Deployment, and observe what happens. Please follow these steps:

Use the kubectl label nodes command to add a node-type label with a slow value for Node 0, an extremelyslow value for Node 1, and an extremelyslow value for Node 2:
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000000 node-type=slow
node/aks-nodepool1-77120516-vmss000000 not labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000001 node-type=extremelyslow
node/aks-nodepool1-77120516-vmss000001 labeled
$ kubectl label nodes --overwrite aks-nodepool1-77120516-vmss000002 node-type=extremelyslow
node/aks-nodepool1-77120516-vmss000002 labeled
At this point, if you were to check Pods assignments using kubectl get pods, there would be no difference. This is because, as we explained before, a Pod's assignment to Nodes is valid only at the time of scheduling, and after that, it is not changed unless they are restarted. To force the restart of Pods, we could scale the Deployment down to zero replicas and then back to five. But there is an easier way, which is to use an imperative kubectl rollout restart command. This approach has the benefit of not making the Deployment unavailable, and it performs a rolling restart of Pods without a decrease in the number of available Pods. Execute the following command:
$ kubectl rollout restart deploy nginx-deployment-example
deployment.apps/nginx-deployment-example restarted
Inspect the Pod status and Node assignment again. You may need to wait a while for the Deployment rollout to finish:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-6c4fdd447d-4mjfm   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-qgqmc   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-qhrtf   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-tnvpm   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6c4fdd447d-ttfnk   Running   aks-nodepool1-77120516-vmss000000
The output shows that, as expected, all Pods have been scheduled to Node 0, which is labeled with node-type=slow. We allow such Nodes if there is nothing better, and in this case Node 1 and Node 2 have label node-type=extremelyslow, which is prohibited by the hard Node anti-affinity rule.
Tip
To achieve even higher granularity and control of Pod scheduling, you can use Pod topology spread constraints. More details are available in the official documentation: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/.

Congratulations, you have successfully configured Node affinity for our Deployment Pods! We will now explore another way of scheduling Pods – taints and tolerations.

Using Node taints and tolerations

Using the Node and inter-Pod affinity mechanism for scheduling Pods is very powerful, but sometimes you need a simpler way of specifying which Nodes should repel Pods. Kubernetes offers a slightly older and simpler feature for this purpose – taints and tolerations. You apply a taint to a given Node (which describes some kind of limitation) and the Pod must have a specific toleration defined to be schedulable on the tainted Node. If the Pod has a toleration, it does not mean that the taint is required on the Node. The real-life definition of taint is "a trace of a bad or undesirable substance or quality," and this reflects the idea pretty well – all Pods will avoid a Node if there is a taint set for them, but we can instruct Pods to tolerate a specific taint.

Tip

If you look closely at how taints and tolerations are described, you can see that you can achieve similar results with Node labels and Node hard and soft affinity rules with the NotIn operator. There is one catch – you can define taints with a NoExecute effect which will result in the termination of the Pod if it cannot tolerate it. You cannot get similar results with affinity rules unless you restart the Pod manually.

Taints for Nodes have the following structure: <key>=<value>:<effect>. The key and value pair identifies the taint and can be used for more granular tolerations, for example tolerating all taints with a given key and any value. This is similar to labels, but please bear in mind that taints are separate properties, and defining a taint does not affect Node labels. In our example demonstration, we will use our own taint with a machine-check-exception key and a memory value. This is, of course, a theoretical example where we want to indicate that there is a hardware issue with memory on the host, but you could also have a taint with the same key and instead a cpu or disk value. In general, your taints should semantically label the type of issue that the Node is experiencing. There is nothing preventing you from using any keys and values for creating taints, but if they make semantic sense, it is much easier to manage them and define tolerations.

The taint can have different effects:

NoSchedule – kube-scheduler will not schedule Pods to this Node. Similar behavior can be achieved using a hard Node affinity rule.
PreferNoSchedule – kube-scheduler will try to not schedule Pods to this Node. Similar behavior can be achieved using a soft Node affinity rule.
NoExecute – kube-scheduler will not schedule Pods to this Node and evict (terminate and reschedule) running Pods from this Node. You cannot achieve similar behavior using Node affinity rules. Note that when you define a toleration for a Pod for this type of taint, it is possible to control how long the Pod will tolerate the taint before it gets evicted, using tolerationSeconds.

Kubernetes manages quite a few NoExecute taints automatically by monitoring the Node hosts. The following taints are built-in and managed by NodeController or the kubelet:

node.kubernetes.io/not-ready: Added when NodeCondition Ready has a false status.
node.kubernetes.io/unreachable: Added when NodeCondition Ready has an Unknown status. This happens when NodeController cannot reach the Node
node.kubernetes.io/out-of-disk: Node has no disk available.
node.kubernetes.io/memory-pressure: Node is experiencing memory pressure.
node.kubernetes.io/disk-pressure: Node is experiencing disk pressure.
node.kubernetes.io/network-unavailable: Network is currently down on the Node.
node.kubernetes.io/unschedulable: Node is currently in an unschedulable state.
node.cloudprovider.kubernetes.io/uninitialized: Intended for Nodes that are prepared by an external cloud provider. When the Node gets initialized by cloud-controller-manager, this taint is removed.

To add a taint on a Node, you use the kubectl taint node command in the following way:

$ kubectl taint node <nodeName> <key>=<value>:<effect>

So, for example, if we want to use key machine-check-exception and a memory value with a NoExecute effect for Node 1, we will use the following command:

$ kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception=memory:NoExecute

To remove the same taint, you need to use the following command (bear in mind the - character at the end of the taint definition):

$ kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception=memory:NoExecute-

You can also remove all taints with a specified key:

kubectl taint node aks-nodepool1-77120516-vmss000001 machine-check-exception:NoExecute-

To counteract the effect of the taint on a Node for specific Pods, you can define tolerations in their specification. In other words, you can use tolerations to ignore taints and still schedule the Pods to such Nodes. If a Node has multiple taints applied, the Pod must tolerate all of its taints. Tolerations are defined under .spec.tolerations in the Pod specification and have the following structure:

tolerations:

- key: <key>

operator: <operatorType>

value: <value>

effect: <effect>

The operator can be either Equal or Exists. Equal means that both key and value of taint must match exactly, whereas Exists means that just key must match and value is not considered. In our example, if we want to ignore the taint, the toleration will need to look like this:

tolerations:

- key: machine-check-exception

operator: Equal

value: memory

effect: NoExecute

You can define multiple tolerations for a Pod.

In the case of NoExecute tolerations, it is possible to define an additional field called tolerationSeconds, which specifies how long the Pod will tolerate the taint until it gets evicted. So, this is a way of having partial toleration of taint with a timeout. Please note that if you use NoExecute taints, you usually also need to add a NoSchedule taint. In this way, you can prevent any eviction loops happening when the Pod has a NoExecute toleration with tolerationSeconds set. This is because the taint has no effect for a specified number of seconds, which also includes not preventing the Pod from being scheduled for the tainted Node.

Important Note

When Pods are created in the cluster, Kubernetes automatically adds two Exists tolerations for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds set to 300.

We will now put this knowledge into practice with a few demonstrations. Please follow the next steps:

If you have the nginx-deployment-example Deployment with Node affinity defined still running from the previous section, it will currently have all Pods running on Node 0. The Node affinity rules are constructed in such a way that the Pods cannot be scheduled on Node 1 and Node 2. Let's see what happens if you taint Node 0 with machine-check-exception=memory:NoExecute:
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoExecute
node/aks-nodepool1-77120516-vmss000000 tainted
Check the Pod status and Node assignment:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-6c4fdd447d-c42z2   Pending   <none>
nginx-deployment-example-6c4fdd447d-dstbl   Pending   <none>
nginx-deployment-example-6c4fdd447d-ktfzh   Pending   <none>
nginx-deployment-example-6c4fdd447d-ptcwc   Pending   <none>
nginx-deployment-example-6c4fdd447d-wdmb9   Pending   <none>
All Deployment Pods are now in the Pending state because kube-scheduler is unable to find a Node that can run them.
Edit the ./nginx-deployment.yaml Deployment manifest and remove affinity. Instead, define taint toleration for machine-check-exception=memory:NoExecute with a timeout of 60 seconds:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment-example
spec:
...
  template:
...
    spec:
      tolerations:
      - key: machine-check-exception
        operator: Equal
        value: memory
        effect: NoExecute
        tolerationSeconds: 60
...
When this manifest is applied to the cluster, the old Node affinity rules which prevented scheduling to Node 1 and Node 2 will be gone. The Pods will be able to schedule on Node 1 and Node 2, but Node 0 has taint machine-check-exception=memory:NoExecute. So, the Pods should not be scheduled to Node 0, as NoExecute implies NoSchedule, right? Let's check that.
Apply the manifest to the cluster using the kubectl apply -f ./nginx-deployment.yaml command and inspect the Pod status and Node assignment again. You may need to wait a while for the Deployment rollout to finish:
$ kubectl get pods -o wide
NAME                                      ...  AGE   IP             NODE
nginx-deployment-example-6b774d7f6c-95ttq ...  14s   10.244.1.230   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6b774d7f6c-hthwj ...  16m   10.244.0.110   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-lskr7 ...  14s   10.244.1.231   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-6b774d7f6c-q94kw ...  16m   10.244.2.19    aks-nodepool1-77120516-vmss000002
nginx-deployment-example-6b774d7f6c-wszfn ...  16m   10.244.0.109   aks-nodepool1-77120516-vmss000001
This result may be a bit surprising. As you can see, we got two Pods scheduled on Node 1 and one Pod on Node 2, but at the same time Node 0 has received two Pods, and they are in eviction loop every 60 seconds! The explanation for this is that tolerationSeconds for the NoExecute taint implies that the whole taint is ignored for 60 seconds. So kube-scheduler can schedule the Pod on Node 0, even though it will get evicted later.
Let's fix this behavior by applying a recommendation to use a NoSchedule taint whenever you use a NoExecute taint. In this way, the evicted Pods will have no chance to be scheduled on the tainted Node again, unless of course they start tolerating this type of taint too. Execute the following command to taint Node 0:
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoSchedule
node/aks-nodepool1-77120516-vmss000000 tainted
Inspect the Pod status and Node assignment again:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-6b774d7f6c-hthwj   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-jfvqn   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-q94kw   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-6b774d7f6c-wszfn   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-6b774d7f6c-z8jx2   Running   aks-nodepool1-77120516-vmss000002
In the output you can see that the Pods are now distributed between Node 1 and Node 2 – exactly as we wanted.
Now, remove both taints from the Node 0:
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception-
node/aks-nodepool1-77120516-vmss000000 untainted
Restart the Deployment to reschedule the Pods using the following command:
$ kubectl rollout restart deploy nginx-deployment-example
deployment.apps/nginx-deployment-example restarted
Inspect the Pod status and Node assignment again:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-56f4d4d96d-nf82h   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-v8m9c   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-56f4d4d96d-vzqn4   Running   aks-nodepool1-77120516-vmss000000
nginx-deployment-example-56f4d4d96d-wpv78   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-x7x92   Running   aks-nodepool1-77120516-vmss000001
The Pods are again distributed evenly between all three Nodes.
And finally, let's see how the combination of NoExecute and NoSchedule taints work, with tolerationSeconds for NoExecute set to 60. Apply two taints to Node 0 again:
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoSchedule
node/aks-nodepool1-77120516-vmss000000 tainted
$ kubectl taint node aks-nodepool1-77120516-vmss000000 machine-check-exception=memory:NoExecute
node/aks-nodepool1-77120516-vmss000000 tainted
Immediately after that, start watching Pods with their Node assignments. Initially, you will see that the Pods are still running on Node 0 for some time. But after 60 seconds, you will see:
$ kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
NAME                                        STATUS    NODE
nginx-deployment-example-56f4d4d96d-44zvt   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-9rg2p   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-nf82h   Running   aks-nodepool1-77120516-vmss000002
nginx-deployment-example-56f4d4d96d-wpv78   Running   aks-nodepool1-77120516-vmss000001
nginx-deployment-example-56f4d4d96d-x7x92   Running   aks-nodepool1-77120516-vmss000001
As we expected, the Pods have been evicted after 60 seconds and there were no eviction-schedule loops.

This has demonstrated a more advanced use case for taints which you cannot easily substitute with Node affinity rules. In the next section, we will give a short overview of kube-scheduler scheduling policies.

Scheduling policies

kube-scheduler decides for which Node a given Pod should be scheduled, in two phases: filtering and scoring. To quickly recap, filtering is the first phase when kube-scheduler finds a set of Nodes that can be used for the running of a Pod. For example, if a Pod tolerates Node taints. In the second phase, scoring, the filtered Nodes are ranked using a scoring system to find the most suitable Node for the Pod.

The way the default kube-scheduler executes these two phases is defined by the scheduling policy. This policy is configurable and can be passed to the kube-scheduler process using the additional arguments --policy-config-file <filename> or --policy-configmap <configMap>.

Important note

In managed Kubernetes clusters, such as the managed Azure Kubernetes Service, you will not be able to change scheduling policy of kube-scheduler, as you do not have access to Kubernetes master Node.

There are two configuration fields that are most important in scheduling policy:

Predicates: Implement the rules for filtering
Priorities: Implement the scoring system

The full list of currently supported predicates and priorities is available in the official documentation: https://kubernetes.io/docs/reference/scheduling/policies/. We will give an overview of a few of the most interesting ones that show how flexible the default kube-scheduler is. Some of the selected predicates are shown in the following list:

PodToleratesNodeTaints: As the name suggests, implements a basic check if a Pod has defined a toleration for current Node taints
PodFitsResources: Implements a check if the Node has enough free resources to meet the requirements specified by a Pod
CheckNodePIDPressure: Implements a check if the Node has enough available process IDs to safely continue running Pods
CheckVolumeBinding: Implements a check if the Node is compatible with PVCs that the Pod requires

Some of the interesting available priorities are as follows:

SelectorSpreadPriority: Ensures that Pods that belong to the same ReplicationController, StatefulSet, and ReplicaSet objects (this includes Deployments) are evenly spread across the Nodes. This ensures better fault tolerance in case of Node failures.
NodeAffinityPriority and InterPodAffinityPriority: Implements the soft Node affinity and inter-Pod affinity/anti-affinity.
ImageLocalityPriority: Prioritizes the Nodes that already have the container images required by a Pod in the local cache to reduce start up time and decrease unnecessary network traffic.
ServiceSpreadingPriority: Attempts to spread the Pods by minimizing the number of Pods belonging to the same Service object running on the same Node. This ensures better fault tolerance in case of Node failures.

The preceding examples are just a subset of the available predicates and priorities, but this already gives an overview of how many complex use cases and scenarios are supported out of the box in kube-scheduler.

Summary

This chapter has given an overview of advanced techniques for Pod scheduling in Kubernetes. First, we recapped the theory behind kube-scheduler implementation. We have explained the process of scheduling Pods. Next, we introduced the concept of Node affinity in Pod scheduling. You learned the basic scheduling methods which use Node names and Node selectors, and based on that we have explained how more advanced Node affinity works. We also explained how you can use the affinity concept to achieve anti-affinity, and what inter-Pod affinity/anti-affinity is. After that, we discussed taints for Nodes and tolerations specified by Pods. You learned about some different effects of the taints, and have put the knowledge into practice in an advanced use case involving NoExecute and NoSchedule taints on a Node. Lastly, we discussed the theory behind scheduler policies that can be used to configure the default kube-scheduler.

In the next chapter, we are going to discuss autoscaling of Pods and Nodes in Kubernetes – this will be a topic that will show how flexibly Kubernetes can run workloads in cloud environments.

Table of Contents for
Chapter 19: Advanced Techniques for Scheduling Pods

Chapter 19: Advanced Techniques for Scheduling Pods

Technical requirements

Refresher – What is kube-scheduler?

Managing Node affinity

Pod Node name

Pod Node selector

Node affinity configuration for Pods

Using Node taints and tolerations

Scheduling policies

Summary

Further reading

Table of Contents for Chapter 19: Advanced Techniques for Scheduling Pods

Create new playlist

Sign In

Sign Up

Chapter 19: Advanced Techniques for Scheduling Pods

Technical requirements

Refresher – What is kube-scheduler?

Managing Node affinity

Pod Node name

Pod Node selector

Node affinity configuration for Pods

Using Node taints and tolerations

Scheduling policies

Summary

Further reading

Table of Contents for
Chapter 19: Advanced Techniques for Scheduling Pods