Prioritizing pods in scheduling

Quality of service assures that a pod can access the appropriate resources, but the philosophy doesn't take the pod's importance into consideration. To be more precise, QoS only comes into play when a pod is scheduled, not during scheduling. Therefore, we need to introduce an orthogonal feature to denote the pod's criticality or importance.

Before 1.11, making a pod's criticality visible to Kubernetes was done by putting the pod in the kube-system namespace and annotating it with scheduler.alpha.kubernetes.io/critical-pod, which is going to be deprecated in the newer version of Kubernetes. See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ for more information.

The priority of a pod is defined by the priority class it belongs to. A priority class uses a 32-bit integer that is less than 1e9 (one billion) to represent the priority. A larger number means a higher priority. Numbers larger than one billion are reserved for system components. For instance, the priority class for critical components uses two billion:

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: system-cluster-critical
value: 2000000000
description: Used for system critical pods that must run in the cluster, but can be moved to another node if necessary.

As the priority class isn't cluster-wide (it is unnamespaced), the optional description field helps cluster users know whether they should use a class. If a pod is created without specifying its class, its priority would be the value of the default priority class or 0, depending on whether there's a default priority class in the cluster. A default priority class is defined by adding a globalDefault:true field in the specification of a priority class. Note that there can only be one default priority class in the cluster. The configuration counterpart at a pod is at the .spec.priorityClassName path.

The principle of the priority feature is simple: if there are waiting pods to be scheduled, Kubernetes will pick higher priority pods first rather than by the order of the pods in the queue. But what if all nodes are unavailable to new pods? If pod preemption is enabled in the cluster (enabled by default from Kubernetes 1.11 onward), then the preemption process would be triggered to make room for higher priority pods. More specifically, the scheduler will evaluate the affinity or the node selector from the pod to find eligible nodes. Afterwards, the scheduler finds pods to be evicted on those eligible nodes according to their priority. If removing all pods with a priority lower than the priority of the pending pod on a node can fit the pending pod, then some of those lower priority pods will be preempted.

Removing all pods sometimes causes unexpected scheduling results while considering the priority of a pod and its affinity with other pods at the same time. For example, let's say there are several running pods on a node, and a pending pod called Pod-P. Assume the priority of Pod-P is higher than all pods on the node, it can preempt every running pod on the target node. Pod-P also has a pod-affinity that requires it to be run together with certain pods on the node. Combine the priority and the affinity, and we'll find that Pod-P won't be scheduled. This is because all pods with a lower priority would be taken into consideration, even if Pod-P doesn't need all the pods to be removed to run on the node. As a result, since removing the pod associated with the affinity of Pod-P breaks the affinity, the node would be seen to not be eligible for Pod-P.

The preemption process doesn't take the QoS class into consideration. Even if a pod is in the guaranteed QoS class, it could still be preempted by best-effort pods with higher priorities. We can see how preemption works with QoS classes with an experiment. Here, we'll use minikube for demonstration purposes because it has only one node, so we can make sure that the scheduler will try to run everything on the same node. If you're going to do the same experiment but on a cluster with multiple nodes, affinity might help.

First, we'll need some priority classes, which can be found in the chapter8/8-1_scheduling/prio-demo.yml file. Just apply the file as follows:

$ kubectl apply -f chapter8/8-1_scheduling/prio-demo.yml
priorityclass.scheduling.k8s.io/high-prio created
priorityclass.scheduling.k8s.io/low-prio created

After that, let's see how much memory our minikube node can provide:

$ kubectl describe node minikube | grep -A 6 Allocated
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests    Limits
  --------  --------    ------
  cpu       675m (33%)  20m (1%)
  memory    150Mi (7%)  200Mi (10%)

Our node has around 93% of allocatable memory. We can arrange two pods with 800 MB memory requests each in low-priority classes, and one higher priority pod with an 80 MB request and limit (and certain CPU limits). The example templates for the two deployments can be found at chapter8/8-1_scheduling/{lowpods-gurantee-demo.yml,highpods-burstable-demo.yml}, respectively. Create the two deployments:

$ kubectl apply -f lowpods-gurantee-demo.yml
deployment.apps/lowpods created
$ kubectl apply -f highpods-burstable-demo.yml
deployment.apps/highpods created
$ kubectl get pod -o wide
NAME               READY STATUS RESTARTS AGE   IP      NODE NOMINATED NODE
highpods-77dd55b549-sdpbv  1/1 Running 0  6s 172.17.0.9 minikube <none>
lowpods-65ff8966fc-xnv4v   1/1 Running 0 23s 172.17.0.7 minikube <none>
lowpods-65ff8966fc-xswjp   1/1 Running 0 23s 172.17.0.8 minikube <none>
$ kubectl describe node | grep -A 6 Allocated
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests      Limits
  --------  --------      ------
  cpu       775m (38%)    120m (6%)
  memory    1830Mi (96%)  1800Mi (95%)

$ kubectl get pod -o go-template --template='{{range .items}}{{printf "pod/%s: %s, priorityClass:%s(%.0f)
" .metadata.name .status.qosClass .spec.priorityClassName .spec.priority}}{{end}}'
pod/highpods-77dd55b549-sdpbv: Burstable, priorityClass:high-prio(100000)
pod/lowpods-65ff8966fc-xnv4v: Guaranteed, priorityClass:low-prio(-1000)
pod/lowpods-65ff8966fc-xswjp: Guaranteed, priorityClass:low-prio(-1000)

We can see that the three pods are running on the same node. Meanwhile, the node is in danger of running out of capacity. The two lower priority pods are in the guaranteed QoS class, while the higher one is in the burstable class. Now, we just need to add one more high priority pod:

$ kubectl scale deployment --replicas=2 highpods
deployment.extensions/highpods scaled
$ kubectl get pod -o wide
NAME               READY STATUS RESTARTS AGE   IP      NODE NOMINATED NODE
highpods-77dd55b549-g2m6t  0/1 Pending 0  3s <none> <none> minikube
highpods-77dd55b549-sdpbv  1/1 Running 0 20s 172.17.0.9 minikube <none>
lowpods-65ff8966fc-rsx7j   0/1 Pending 0  3s <none> <none> <none>
lowpods-65ff8966fc-xnv4v   1/1 Terminating 0 37s 172.17.0.7 minikube <none>
lowpods-65ff8966fc-xswjp   1/1 Running 0 37s 172.17.0.8 minikube <none>
$ kubectl describe pod lowpods-65ff8966fc-xnv4v
...
Events:
...
  Normal Started 41s kubelet, minikube Started container
  Normal Preempted 16s default-scheduler by default/highpods-77dd55b549-g2m6t on node minikube

As soon as we add a higher priority pod, one of the lower priorities is killed. From the event messages, we can clearly see that the reason the pod is terminated is that the pod is being preempted, even if it's in the guaranteed class. One thing to be noted is that the new lower priority pod, lowpods-65ff8966fc-rsx7j, is started by its deployment rather than a restartPolicy on the pod.

Table of Contents for Prioritizing pods in scheduling

Create new playlist

Sign In

Sign Up

Table of Contents for
Prioritizing pods in scheduling