Needless to say, having autoscaling capabilities for your cloud-native application is considered the holy grail of running applications in cloud. In short, by autoscaling, we mean a method to automatically and dynamically adjust the amount of computational resources, such as CPU and RAM memory, available to your application. The goal of it is to cleverly add or remove available resources based on the activity and demand of end users. So, for example, the application may require more CPU and RAM memory during daytime hours, when users are most active, but much less during the night. Similarly, if you are running an e-commerce business, you can expect a huge spike in demand during so-called Black Friday. In this way, you can not only provide a better, highly available service to users but also reduce your cost of goods sold (COGS) for the business. The fewer resources you consume in the cloud, the less you pay, and the business can invest the money elsewhere – this is a win-win situation. There is, of course, no single rule that fits all use cases, hence good autoscaling needs to be based on critical usage metrics and should have predictive features to anticipate the workloads based on history.
Kubernetes, as the most mature container orchestration system available, comes with a variety of built-in autoscaling features. Some of these features are natively supported in every Kubernetes cluster and some require installation or specific type of cluster deployment. There are also multiple dimensions of scaling that you can have:
In this chapter, we will cover the following topics:
For this chapter, you will need the following:
Basic Kubernetes cluster deployment (local and cloud-based) and kubectl installation have been covered in Chapter 3, Installing Your First Kubernetes Cluster.
The following chapters can provide you with an overview of how to deploy a fully functional Kubernetes cluster on different cloud platforms and install the requisite CLIs to manage them:
You can download the latest code samples for this chapter from the official GitHub repository at https://github.com/PacktPublishing/The-Kubernetes-Bible/tree/master/Chapter20.
Before we dive into the topics of autoscaling in Kubernetes, we need to explain a bit more about how you can control the CPU and memory resource (known as compute resources) usage by Pod containers in Kubernetes. Controlling the use of compute resources is important since, in this way, you can enforce resource governance – this allows better planning of the cluster capacity and, most importantly, prevents situations when a single container can consume all compute resources and prevent other Pods from serving the requests.
When you create a Pod, it is possible to specify how much compute resources its containers require and what the limits are in terms of permitted consumption. The Kubernetes resource model provides an additional distinction between two classes of resources: compressible and incompressible. In short, a compressible resource can be easily throttled, without severe consequences. A perfect example of such a resource is the CPU – if you need to throttle CPU usage for a given container, the container will operate normally, just slower. On the other hand, we have incompressible resources that cannot be throttled without sever consequences – RAM memory allocation is an example of such a resource. If you do not allow a process running in a container to allocate more memory, the process will crash and result in container restart.
Important Note
If you want to know more about the philosophy and design decisions for the Kubernetes resource governance model, we recommend reading the official design proposal documents. Resource model: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/resources.md. Resource quality of service: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md.
To control the resources for a Pod container, you can specify two values in its specification:
If you use different values for requests and limits, you can allow for resource overcommit. This technique is useful for efficiently handling short bursts of resource usage while allowing better resource usage on average. The reasoning behind this is that you will rarely have all containers on the Node requiring maximum resources, as they specify in limits, at the same time. This gives you better bin packing of your Pods for the majority of the time. The concept is similar to overprovisioning for virtual machine hypervisors or, in the real world, overbooking for airplane flights.
If you do not specify limits at all, the container can consume as much of the resource on a Node as it wants. This can be controlled by namespace resource quotas and limit ranges – you can read more about these objects in the official documentation: https://kubernetes.io/docs/concepts/policy/limit-range/.
Tip
In more advanced scenarios, you can also control huge pages and ephemeral storage requests and limits.
Before we dive into the configuration details, we need to look at what are the units for measuring CPU and memory in Kubernetes. For CPU, the base unit is Kubernetes CPU (KCU), where 1 is equivalent to, for example, 1 vCPU on Azure, 1 core on GCP, or 1 hyperthreaded core on a bare-metal machine. Fractional values are allowed: 0.1 can be also specified as 100m (milliKCUs). For memory, the base unit is byte; you can, of course, specify standard unit prefixes, such as M, Mi, G, or Gi.
To enable compute resource requests and limits for Pod containers in our nginx Deployment that we used in the previous chapters, you can make the following changes to the YAML manifest, nginx-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-example
spec:
replicas: 5
selector:
matchLabels:
app: nginx
environment: test
template:
metadata:
labels:
app: nginx
environment: test
spec:
containers:
- name: nginx
image: nginx:1.17
ports:
- containerPort: 80
resources:
limits:
cpu: 200m
memory: 60Mi
requests:
cpu: 100m
memory: 50Mi
For each container that you have in the Pod, you can specify the .spec.template.spec.containers[*].resources field. In this case, we have set limits at 200m KCU and 60Mi for RAM, and requests at 100m KCU and 50Mi for RAM.
When you apply the manifest to the cluster using kubectl apply -f ./nginx-deployment.yaml, you can describe one of the Nodes in the cluster that run Pods for this Deployment and you will see detailed information about compute resources quotas and allocation:
$ kubectl describe node aks-nodepool1-77120516-vmss000000
...
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default nginx-deployment-example-5d8b9979d4-9sd9x 100m (5%) 200m (10%) 50Mi (1%) 60Mi (1%) 8m12s
default nginx-deployment-example-5d8b9979d4-rbwv2 100m (5%) 200m (10%) 50Mi (1%) 60Mi (1%) 8m10s
default nginx-deployment-example-5d8b9979d4-sfzx9 100m (5%) 200m (10%) 50Mi (1%) 60Mi (1%) 8m10s
kube-system kube-proxy-q6xdq 100m (5%) 0 (0%) 0 (0%) 0 (0%) 10d
kube-system omsagent-czm6q 75m (3%) 500m (26%) 225Mi (4%) 600Mi (13%) 17d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 475m (25%) 1100m (57%)
memory 375Mi (8%) 780Mi (17%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-azure-disk 0 0
Now, based on this information, you could experiment, and set requests for CPU for the container to a value higher than the capacity of a single Node in the cluster, in our case, 2000m KCU. When you do that and apply the changes to the Deployment, you will notice that new Pods hang in the Pending state because they cannot be scheduled on a matching Node. In such cases, inspecting the Pod will reveal the following:
$ kubectl describe pod nginx-deployment-example-56868549b-5n6lj
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 25s default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
There were no Nodes that could accommodate a Pod that has a container requiring 2000m KCU, and therefore the Pod cannot be scheduled at this moment.
With knowledge of how to manage compute resources, we will move on to autoscaling topics: first, we are going to explain the vertical autoscaling of Pods.
In the previous section, we have been managing requests and limits for the compute resources manually. Setting these values correctly requires some accurate human guessing, observing metrics, and performing benchmarks to adjust. Using overly high requests values will result in a waste of compute resources, whereas setting it too low may result in Pods being packed too densely and having performance issues. Also, in some cases, the only way to scale the Pod workload is to do it vertically by increasing the amount of compute resources it can consume. For bare-metal machines, this would mean upgrading the CPU hardware and adding more physical RAM memory. For containers, it is as simple as allowing them more of the compute resource quotas. This works, of course, only up to the capacity of a single Node. You cannot scale vertically beyond that unless you add more powerful Nodes to the cluster.
To help resolve these issues, Kubernetes offers a Vertical Pod Autoscaler (VPA), which can increase and decrease CPU and memory resource requests for Pod containers dynamically. The goal is to better match the actual usage rather than rely on hardcoded, predefined values. Controlling limits within specified ratios is also supported.
The VPA is created by a Custom Resource Definition (CRD) object named VerticalPodAutoscaler. This means that this object is not part of standard Kubernetes API groups and needs to be installed in the cluster. The VPA is developed as part of an autoscaler project (https://github.com/kubernetes/autoscaler) in the Kubernetes ecosystem.
There are three main components of a VPA:
The reason why the updater needs to terminate Pods and the VPA has to rely on the admission plugin is that Kubernetes does not support dynamic changes to the resource requests and limits. The only way is to terminate the Pod and create a new one with new values. In-place modifications of values are tracked in KEP1287 (https://github.com/kubernetes/enhancements/pull/1883) and, when implemented, will make the design of the VPA much simpler, thereby ensuring improved high availability.
Important note
A VPA can run in recommendation-only mode where you see the suggested values in the VPA object, but the changes are not applied to the Pods. A VPA is currently considered experimental and using it in a mode that recreates the Pods may lead to downtimes of your application. This should change when in-place updates of Pod requests and limits are implemented.
Some Kubernetes offerings come with one-click support for installing a VPA. Two good examples are OpenShift and GKE. We will now quickly explain how you can do that if you are running a GKE cluster.
Assuming that your GKE cluster is named k8sforbeginners, as in Chapter 14, Kubernetes Clusters on Google Kubernetes Engine, enabling a VPA is as simple as running the following command:
$ gcloud container clusters update k8sforbeginners --enable-vertical-pod-autoscaling
Note that this operation causes a restart to the Kubernetes control plane.
If you want to enable a VPA for a new cluster, you can use the additional argument --enable-vertical-pod-autoscaling, for example:
$ gcloud container clusters create k8sforbeginners --num-nodes=2 --zone=us-central1-a --enable-vertical-pod-autoscaling
The GKE cluster will have a VPA CRD available, and you can use it to control the vertical autoscaling of Pods.
In the case of different platforms such as AKS or EKS (or even local deployments for testing), you need to install a VPA manually by adding a VPA CRD to the cluster. The exact, most recent steps are documented in the corresponding GitHub repository: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#installation.
To install a VPA in your cluster, please perform the following steps:
$ git clone https://github.com/kubernetes/autoscaler
$ cd autoscaler/vertical-pod-autoscaler
$ ./hack/vpa-up.sh
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
vpa-admission-controller-688857d5c4-4l9c2 1/1 Running 0 10s
vpa-recommender-74849cc845-qbfpg 1/1 Running 0 11s
vpa-updater-6dbd6569d6-9np22 1/1 Running 0 12s
The VPA components are running, and we can now proceed to testing a VPA on real Pods.
For demonstration purposes, we need a Deployment with Pods that cause actual consumption of CPU. The Kubernetes autoscaler repository has a good, simple example that has predictable CPU usage: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/examples/hamster.yaml. We are going to modify this example a bit and do a step-by-step demonstration. Let's prepare the Deployment first:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hamster
spec:
selector:
matchLabels:
app: hamster
replicas: 5
template:
metadata:
labels:
app: hamster
spec:
containers:
- name: hamster
image: ubuntu:20.04
resources:
requests:
cpu: 100m
memory: 50Mi
command:
- /bin/sh
- -c
- while true; do timeout 0.5s yes >/dev/null; sleep 0.5s; done
It's a real hamster! The command that is used in the Pod's ubuntu container consumes the maximum available CPU of 0.5 seconds and does nothing for 0.5 seconds, all the time. This means that the actual CPU usage will stay, on average, at around 500m KCU. However, the requests for resources specify that it requires 100m KCU. This means that the Pod will consume more than it declares, but since there are no limits set, Kubernetes will not throttle the container CPU. This could potentially lead to incorrect scheduling decisions by Kubernetes Scheduler.
$ kubectl apply -f ./hamster-deployment.yaml
deployment.apps/hamster created
$ kubectl top pod
NAME CPU(cores) MEMORY(bytes)
hamster-779cfd69b4-5bnbf 475m 1Mi
hamster-779cfd69b4-8dt5h 497m 1Mi
hamster-779cfd69b4-mn5p5 492m 1Mi
hamster-779cfd69b4-n7nss 496m 1Mi
hamster-779cfd69b4-rl29j 484m 1Mi
As we expected, the CPU consumption for each Pod in the deployment oscillates at around 500m KCU.
With that, we can move on to creating a VPA for our Pods. VPAs can operate in four modes that you specify by means of the .spec.updatePolicy.updateMode field:
We are going to first create a VPA for hamster Deployment, which runs in Off mode, and later we will enable Auto mode. To do this, please perform the following steps:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: hamster-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: hamster
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 500Mi
controlledResources:
- cpu
- memory
This VPA is created for a Deployment object with the name hamster, as specified in .spec.targetRef. The mode is set to "Off" in .spec.updatePolicy.updateMode ("Off" needs to be specified in quotes to avoid being interpreted as a Boolean) and the container resource policy is configured in .spec.resourcePolicy.containerPolicies. The policy that we used allows Pod container requests for CPU to be adjusted automatically between 100m KCU and 1000m KCU, and for memory between 50Mi and 500Mi.
$ kubectl apply -f ./hamster-vpa.yaml
verticalpodautoscaler.autoscaling.k8s.io/hamster-vpa created
$ kubectl describe vpa hamster-vpa
...
Status:
Conditions:
Last Transition Time: 2021-03-28T14:33:33Z
Status: True
Type: RecommendationProvided
Recommendation:
Container Recommendations:
Container Name: hamster
Lower Bound:
Cpu: 551m
Memory: 262144k
Target:
Cpu: 587m
Memory: 262144k
Uncapped Target:
Cpu: 587m
Memory: 262144k
Upper Bound:
Cpu: 1
Memory: 378142066
The VPA has recommended allocating a bit more than the expected 500m KCU and 262144k memory. This makes sense, as the Pod should have a safe buffer for CPU consumption.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: hamster-vpa
spec:
...
updatePolicy:
updateMode: Auto
...
$ kubectl apply -f ./hamster-vpa.yaml
verticalpodautoscaler.autoscaling.k8s.io/hamster-vpa configured
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hamster-779cfd69b4-5bnbf 1/1 Running 0 45m
hamster-779cfd69b4-8dt5h 1/1 Terminating 0 45m
hamster-779cfd69b4-9tqfx 1/1 Running 0 60s
hamster-779cfd69b4-n7nss 1/1 Running 0 45m
hamster-779cfd69b4-wdz8t 1/1 Running 0 60s
$ kubectl describe pod hamster-779cfd69b4-9tqfx
...
Annotations: vpaObservedContainers: hamster
vpaUpdates: Pod resources updated by hamster-vpa: container 0: cpu request, memory request
...
Containers:
hamster:
...
Requests:
cpu: 587m
memory: 262144k
...
As you can see, the newly started Pod has CPU and memory requests set to the values recommended by the VPA!
Important note
A VPA should not be used with an HPA running on CPU/memory metrics at this moment. However, you can use a VPA in conjunction with an HPA running on custom metrics.
Next, we are going to discuss how you can horizontally autoscale Pods using a Horizontal Pod Autoscaler (HPA).
While a VPA acts like an optimizer of resource usage, the true scaling of your Deployments and StatefulSets that run multiple Pod replicas can be done using a Horizontal Pod Autoscaler (HPA). At a high level, the goal of the HPA is to automatically scale the number of replicas in Deployment or StatefulSets depending on the current CPU utilization or other custom metrics (including multiple metrics at once). The details of the algorithm that determines the target number of replicas based on metric values can be found here: https://kubernetes.io/docs/tasks/run-application/horizontal-Pod-autoscale/#algorithm-details. HPAs are highly configurable and, in this chapter, we will cover a standard scenario in which we would like to autoscale based on target CPU usage.
Important note
An HPA is represented by a built-in HorizontalPodAutoscaler API resource in Kubernetes in the autoscaling API group. The current stable version that supports CPU autoscaling only can be found in the autoscaling/v1 API version. The beta version that supports autoscaling based on RAM and custom metrics can be found in the autoscaling/v2beta2 API version.
The role of the HPA is to monitor the configured metric for Pods, for example, CPU usage, and determine whether there is a change to the number of replicas needed. Usually, the HPA will calculate the average of the current metric value from all Pods and determine whether adding or removing replicas will bring the metric value closer to the specified target value. For example, you set the target CPU usage to be 50%. At some point, increased demand for the application causes the Deployment Pods to have 80% CPU usage. The HPA will decide to add more Pod replicas so that the average usage across all replicas will fall and be closer to 50%. And the cycle repeats. In other words, the HPA tries to maintain the average CPU usage to be as close to 50% as possible. This is like a continuous, closed-loop controller – in real life, a thermostat reacting to temperature changes in the building is a good, similar example. HPA additionally uses mechanisms such as a stabilization window to prevent the replicas from scaling down too quickly and causing unwanted replica flapping.
Tip
GKE has added support for multidimensional Pod autoscaling that combines horizontal scaling using CPU metrics and vertical scaling based on memory usage at the same time. You can read more about this feature in the official documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/multidimensional-pod-autoscaling.
As an HPA is a built-in feature of Kubernetes, there is no need to perform any installation. We just need to prepare a Deployment for testing and create a HorizontalPodAutoscaler API object.
To test an HPA, we are going to rely on the standard CPU usage metric. This means that we need to configure requests for CPU on the Deployment Pods, otherwise autoscaling is not possible as there is no absolute number that is needed to calculate the percentage metric. On top of that, we again need a Deployment that can consume a predictable amount of CPU resources. Of course, in real use cases, the varying CPU usage would be coming from actual demand for your application from end users.
Unfortunately, there is no simple way to have predictable and varying CPU usage in a container out of the box, so we have to prepare a Deployment with a Pod template that will do that. We will modify our hamster Deployment approach and create an elastic-hamster Deployment. The small shell script running continuously in the container will behave slightly differently. We will assign the total desired work by hamsters in all Pods together. Each Pod will query the Kubernetes API to check how many replicas there are currently running for the Deployment. Then, we will divide the total desired work by the number of replicas to get the amount of work that a single hamster needs to do. So, for example, we will say that all hamsters together should do 1.0 of work, which roughly maps to the total consumption of KCU in the cluster. Then, if you deploy five replicas for the Deployment, each of the hamsters will do 1.0/5 = 0.2 work, so they will work for 0.2 seconds and sleep for 0.8 seconds. Now, if we scale the Deployment manually to 10 replicas, the amount of work per hamster will fall to 0.1 seconds, and they will sleep for 0.9 seconds. As you can see, they collectively always work for 1.0 second, no matter how many replicas we use. This kind of reflects a real-life scenario where end users cause some amount of traffic to handle, and you distribute it among the Pod replicas. The more Pod replicas you have, the less traffic they have to handle and, in the end, the CPU usage metric will be lower on average.
Querying Deployments via the Kubernetes API will require some additional RBAC setup. You can find more details in Chapter 18, Authentication and Authorization on Kubernetes. To create the deployment for the demonstration, please perform the following steps:
apiVersion: v1
kind: ServiceAccount
metadata:
name: elastic-hamster
namespace: default
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: deployment-reader
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "watch", "list"]
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-deployments
namespace: default
subjects:
- kind: ServiceAccount
name: elastic-hamster
namespace: default
roleRef:
kind: Role
name: deployment-reader
apiGroup: rbac.authorization.k8s.io
apiVersion: apps/v1
kind: Deployment
metadata:
name: elastic-hamster
spec:
selector:
matchLabels:
app: elastic-hamster
replicas: 5
template:
metadata:
labels:
app: elastic-hamster
spec:
serviceAccountName: elastic-hamster
containers:
- name: hamster
image: ubuntu:20.04
resources:
requests:
cpu: 200m
memory: 50Mi
env:
- name: TOTAL_HAMSTER_USAGE
value: "1.0"
command:
- /bin/sh
- -c
- |
... shell command available in the next step ...
While it is not a good practice to have long shell scripts in the YAML manifest definitions, it is easier for demonstration purposes than creating a dedicated container image, pushing it to the image repository, and consuming it. Let's take a look at what is happening in the manifest file. Initially, we need to have five replicas. Each Pod container has requests with cpu set to 200m KCU and memory set to 50Mi. We also define an environment variable, TOTAL_HAMSTER_USAGE, with an initial value of "1.0" for more readability. This variable defines the total collective work that the hamsters are expected to do.
# Install curl and jq
apt-get update && apt-get install -y curl jq || exit 1
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
TOKEN=$(cat ${SERVICEACCOUNT}/token)
while true
# Calculate CPU usage by hamster. This will dynamically adjust to be 1.0 / num_replicas. So for initial 5 replicas, it will be 0.2
HAMSTER_USAGE=$(curl -s --cacert $SERVICEACCOUNT/ca.crt --header "Authorization: Bearer $TOKEN" -X GET https://kubernetes/apis/apps/v1/namespaces/default/deployments/elastic-hamster | jq ${TOTAL_HAMSTER_USAGE}/'.spec.replicas')
# Hamster sleeps for the rest of the time, with a small adjustment factor
HAMSTER_SLEEP=$(jq -n 1.2-$HAMSTER_USAGE)
echo "Hamster uses $HAMSTER_USAGE and sleeps $HAMSTER_SLEEP"
do timeout ${HAMSTER_USAGE}s yes >/dev/null
sleep ${HAMSTER_SLEEP}s
done
The shell script, as the very first step, installs curl and jq packages from the APT repository. We define SERVICEACCOUNT and TOKEN variables, which we need to query the Kubernetes API. Then, we retrieve the elastic-hamster Deployment from the API using https://kubernetes/apis/apps/v1/namespaces/default/deployments/elastic-hamster. The result is parsed using the jq command, we extract the .spec.replicas field, and use it to divide the total work between all hamsters. Based on this number, we make the hamster work for a calculated period of time and then sleep for the rest. As you can see, if the number of replicas for the Deployment changes, either by means of a manual action or autoscaling, the amount of work to be done by an individual hamster will change. And therefore, the CPU usage will decrease the more Pod replicas we have.
$ kubectl apply -f ./
role.rbac.authorization.k8s.io/deployment-reader created
deployment.apps/elastic-hamster created
serviceaccount/elastic-hamster created
rolebinding.rbac.authorization.k8s.io/read-deployments created
$ kubectl logs elastic-hamster-5897858459-26bdd
...
Running hooks in /etc/ca-certificates/update.d...
done.
Hamster uses 0.2 and sleeps 1
Hamster uses 0.2 and sleeps 1
...
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
elastic-hamster-5897858459-26bdd 229m 40Mi
elastic-hamster-5897858459-f2856 210m 40Mi
elastic-hamster-5897858459-lmphl 236m 40Mi
elastic-hamster-5897858459-m6j58 225m 40Mi
elastic-hamster-5897858459-qfh76 227m 41Mi
$ kubectl scale deploy elastic-hamster --replicas=2
deployment.apps/elastic-hamster scaled
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
elastic-hamster-5897858459-m6j58 462m 40Mi
elastic-hamster-5897858459-qfh76 474m 40Mi
With the Deployment ready, we can start using the HPA to automatically adjust the number of replicas, which will target 75% of average CPU utilization across individual Pods. To do that, perform the following steps:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: elastic-hamster-hpa
spec:
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 75
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: elastic-hamster
The HPA targets elastic-hamster deployment, which we have provided using .spec.scaleTargetRef. The configuration that we specified ensures that the HPA will always keep the number of replicas between minReplicas: 1 and maxReplicas: 10. The most important parameter in the HPA targeting the CPU metric is targetCPUUtilizationPercentage, which we have set to 75%. This means that the HPA will try to target 75% of the container requests value for cpu, which we set to be 200m KCU. As a result, the HPA will try to keep the CPU consumption at around 150m KCU. Our current Deployment with two replicas only is consuming much more, on average, 500m KCU.
$ kubectl apply -f ./elastic-hamster-hpa.yaml
horizontalpodautoscaler.autoscaling/elastic-hamster-hpa created
$ kubectl describe hpa elastic-hamster-hpa
...
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 79% (159m) / 75%
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 15m horizontal-pod-autoscaler New size: 4; reason: cpu resource utilization (percentage of request) above target
Normal SuccessfulRescale 14m horizontal-pod-autoscaler New size: 6; reason: cpu resource utilization (percentage of request) above target
Normal SuccessfulRescale 13m horizontal-pod-autoscaler New size: 8; reason: cpu resource utilization (percentage of request) above target
Normal SuccessfulRescale 11m horizontal-pod-autoscaler New size: 9; reason: cpu resource utilization (percentage of request) above target
In the output, you can see that the Deployment was gradually scaled up over time as it eventually stabilized at 9 replicas. Note that for you, the numbers may vary slightly. If you hit the maximum number of allowed replicas (10), you may try increasing the number or adjust the targetCPUUtilizationPercentage parameter.
Tip
It is possible to use an imperative command to achieve a similar result: kubectl autoscale deploy elastic-hamster --cpu-percent=75 --min=1 --max=10.
Congratulations! You have successfully configured horizontal autoscaling for your Deployment using an HPA. In the next section, we will take a look at autoscaling Kubernetes Nodes using a CA which gives even more flexibility when combined with an HPA.
So far, we have discussed scaling at the level of individual Pods, but this is not the only way in which you can scale your workloads on Kubernetes. It is possible to scale the cluster itself to accommodate changes in demand for compute resources – at some point, we will need more Nodes to run more Pods. This is solved by the CA, which is part of the Kubernetes autoscaler repository (https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler). The CA must be able to provision and deprovision Nodes for the Kubernetes cluster, so this means that vendor-specific plugins must be implemented. You can find the list of supported cloud service providers here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#deployment.
The CA periodically checks the status of Pods and Nodes and decides whether it needs to take action:
Important note
Pod containers must specify requests for the compute resources to make the CA work properly. Additionally, these values should reflect real usage, otherwise the CA will not be able to take correct decisions for your type of workload.
As you can see, the CA can complement HPA capabilities. If the HPA decides that there should be more Pods for a Deployment or StatefulSet, but no more Pods can be scheduled, then the CA can intervene and increase the cluster size.
Enabling the CA entails different steps depending on your cloud service provider. Additionally, some configuration values are specific for each of them. We will first take a look at GKE.
For GKE, it is easiest to create a cluster with CA enabled from scratch. To do that, you need to run the following command to create a cluster named k8sforbeginners:
$ gcloud container clusters create k8sforbeginners --num-nodes=2 --zone=us-central1-a --enable-autoscaling --min-nodes=2 --max-nodes=10
You can control the minimum number of Nodes in autoscaling by using the --min-nodes parameter, and the maximum number of Nodes by using the --max-nodes parameter.
In the case of an existing cluster, you need to enable the CA on an existing Node pool. For example, if you have a cluster named k8sforbeginners with one Node pool named nodepool1, then you need to run the following command:
$ gcloud container clusters update k8sforbeginners --enable-autoscaling --min-nodes=2 --max-nodes=10 --zone=us-central1-a --node-pool=nodepool1
The update will take a few minutes.
You can learn more in the official documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler.
Once configured, you can move on to Using the cluster autoscaler.
Setting up the CA in Amazon EKS cannot currently be realized in a one-click or one-command action. You need to create an appropriate IAM policy and role, deploy the CA resources to the Kubernetes cluster, and undertake manual configuration steps. For this reason, we will not cover this in the book and we request that you refer to the official instructions: https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html.
Once configured, you can move on to Using the cluster autoscaler.
AKS provides a similar CA setup experience to GKE – you can use a one-command procedure to either deploy a new cluster with CA enabled or update the existing one to use the CA. To create a new cluster named k8sforbeginners-aks from scratch in the k8sforbeginners-rg resource group, execute the following command:
$ az aks create --resource-group k8sforbeginners-rg --name k8sforbeginners-aks --node-count 2 --enable-cluster-autoscaler --min-count 2 --max-count 10
You can control the minimum number of Nodes in autoscaling by using the --min-count parameter, and the maximum number of Nodes by using the --max-count parameter.
To enable the CA on an existing AKS cluster named k8sforbeginners-aks, execute the following command:
$ az aks update --resource-group k8sforbeginners-rg --name k8sforbeginners-aks --enable-cluster-autoscaler --min-count 2 --max-count 10
The update will take a few minutes.
You can learn more in the official documentation: https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler. Additionally, the CA in AKS has more parameters that you can configure using autoscaler profile. Further details are provided in the official documentation at https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler#using-the-autoscaler-profile.
Now, let's take a look at how you can use the CA.
We have just configured the CA for the cluster and now it may take a bit of time until the CA performs its first actions. This depends on the CA configuration, which may be vendor-specific. For example, in the case of AKS, the cluster will be evaluated every 10 seconds (scan-interval), whether it needs to be scaled up or down. If scaling down needs to happen after scaling up, there is a 10-minute delay (scale-down-delay-after-add). Scaling down will be triggered if the sum of requested resources divided by capacity is below 0.5 (scale-down-utilization-threshold).
As a result, the cluster may automatically scale up, scale down, or remain unchanged after the CA was enabled. If you are using exactly the same cluster setup as we did in the examples, you will have the following situation:
This means that the cluster with the current workload will either scale down by one Node or remain unchanged.
But instead, we can do some modifications to our elastic-hamster Deployment to trigger a more firm decision from CA. We will increase the total amount of work requested from the elastic-hamster Deployment and also increase the requests for CPU by its Pods. Additionally, we will allow more replicas to be created by the HPA. This will result in quickly exceeding the cluster capacity of 6000m KCU and cause the CA to scale the cluster up. To do the demonstration, please perform the following steps:
apiVersion: apps/v1
kind: Deployment
metadata:
name: elastic-hamster
spec:
...
replicas: 7
template:
...
spec:
serviceAccountName: elastic-hamster
containers:
- name: hamster
image: ubuntu:20.04
resources:
requests:
cpu: 500m
memory: 50Mi
env:
- name: TOTAL_HAMSTER_USAGE
value: "7.0"
...
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: elastic-hamster-hpa
spec:
minReplicas: 1
maxReplicas: 25
...
$ kubectl apply -f ./
role.rbac.authorization.k8s.io/deployment-reader unchanged
deployment.apps/elastic-hamster configured
horizontalpodautoscaler.autoscaling/elastic-hamster-hpa configured
serviceaccount/elastic-hamster unchanged
rolebinding.rbac.authorization.k8s.io/read-deployments unchanged
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
...
elastic-hamster-5854d5f967-cjsmg 0/1 Pending 0 23s
elastic-hamster-5854d5f967-nsnqd 0/1 Pending 0 23s
...
$ kubectl get node
NAME STATUS ROLES AGE VERSION
aks-nodepool1-77120516-vmss000000 Ready agent 22d v1.18.14
aks-nodepool1-77120516-vmss000001 Ready agent 22d v1.18.14
aks-nodepool1-77120516-vmss000002 Ready agent 29h v1.18.14
aks-nodepool1-77120516-vmss000003 Ready agent 2m47s v1.18.14
aks-nodepool1-77120516-vmss000004 NotReady <none> 5s v1.18.14
$ kubectl describe pod elastic-hamster-5854d5f967-grjbj
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m28s default-scheduler 0/7 nodes are available: 7 Insufficient cpu.
Warning FailedScheduling 3m6s default-scheduler 0/8 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 7 Insufficient cpu.
Normal Scheduled 2m55s default-scheduler Successfully assigned default/elastic-hamster-5854d5f967-grjbj to aks-nodepool1-77120516-vmss000007
Normal TriggeredScaleUp 4m55s cluster-autoscaler pod triggered scale-up: [{aks-nodepool1-77120516-vmss 7->8 (max: 10)}]
$ kubectl describe hpa elastic-hamster-hpa
...
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 82% (410m) / 75%
Min replicas: 1
Max replicas: 25
Deployment pods: 16 current / 16 desired
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-nodepool1-77120516-vmss000000 981m 51% 2212Mi 48%
aks-nodepool1-77120516-vmss000001 1297m 68% 2121Mi 46%
aks-nodepool1-77120516-vmss000002 486m 25% 883Mi 19%
aks-nodepool1-77120516-vmss000003 475m 25% 933Mi 20%
aks-nodepool1-77120516-vmss000004 507m 26% 945Mi 20%
aks-nodepool1-77120516-vmss000005 902m 47% 987Mi 21%
aks-nodepool1-77120516-vmss000006 1304m 68% 1028Mi 22%
aks-nodepool1-77120516-vmss000007 1263m 66% 1018Mi 22%
This shows how the CA has worked together with the HPA to seamlessly scale the Deployment and cluster at the same time to accommodate the workload. We will now show what automatic scaling down looks like. Perform the following steps:
apiVersion: apps/v1
kind: Deployment
metadata:
name: elastic-hamster
spec:
...
template:
...
spec:
...
containers:
- name: hamster
...
env:
- name: TOTAL_HAMSTER_USAGE
value: "7.0"
...
$ kubectl apply -f ./elastic-hamster-deployment.yaml
deployment.apps/elastic-hamster configured
$ kubectl describe hpa elastic-hamster-hpa
...
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 66% (331m) / 75%
Min replicas: 1
Max replicas: 25
Deployment pods: 3 current / 3 desired
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodeool1-77120516-vmss000000 Ready agent 22d v1.18.14
aks-nodepool1-77120516-vmss000001 Ready agent 22d v1.18.14
aks-nodepool1-77120516-vmss000003 NotReady agent 56m v1.18.14
aks-nodepool1-77120516-vmss000004 Ready agent 53m v1.18.14
aks-nodepool1-77120516-vmss000005 NotReady agent 51m v1.18.14
aks-nodepool1-77120516-vmss000006 NotReady agent 47m v1.18.14
aks-nodepool1-77120516-vmss000007 NotReady agent 42m v1.18.14
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-77120516-vmss000000 Ready agent 22d v1.18.14
aks-nodepool1-77120516-vmss000001 Ready agent 22d v1.18.14
This shows how efficiently the CA can react to a decrease in the load in the cluster when the HPA has scaled down the Deployment. Earlier, without any intervention, the cluster scaled to eight Nodes for a short period of time, and then scaled down to just two Nodes. Imagine the cost difference between having an eight-Node cluster running all the time and using the CA to cleverly autoscale on demand!
Tip
To ensure that you are not charged for any unwanted cloud resources, you need to clean up the cluster or disable cluster autoscaling to be sure that you are not running too many Nodes.
This demonstration concludes our chapter about autoscaling in Kubernetes. Let's summarize what we have learned in this chapter.
In this chapter, you have learned about autoscaling techniques in Kubernetes clusters. We first explained the basics behind Pod resource requests and limits and why they are crucial for the autoscaling and scheduling of Pods. Next, we introduced the VPA, which can automatically change requests and limits for Pods based on current and past metrics. After that, you learned about the HPA, which can be used to automatically change the number of Deployment or StatefulSet replicas. The changes are done based on CPU, memory, or custom metrics. Lastly, we explained the role of the CA in cloud environments. We also demonstrated how you can efficiently combine using the HPA with the CA to achieve the scaling of your workload together with the scaling of the cluster.
There is much more that can be configured in the VPA, HPA, and CA, so we have just scratched the surface of powerful autoscaling in Kubernetes!
In the last chapter, we will explain how you can use Ingress in Kubernetes for advanced traffic routing.
For more information regarding autoscaling in Kubernetes, please refer to the following PacktPub books:
You can also refer to the official documentation: