In this chapter, we will explore the automated pod scalability that Kubernetes provides, how it affects rolling updates, and how it interacts with quotas. We will touch on the important topic of provisioning and how to choose and manage the size of the cluster. Finally, we will go over how the Kubernetes team tests the limits of Kubernetes with a 2,000 node cluster. Here are the main points we will cover:
At the end of this chapter, you will have the ability to plan a large-scale cluster, provision it economically, and make informed decisions about the various trade-offs between performance, cost, and availability. You will also understand how to set up horizontal pod auto-scaling and use resource quotas intelligently to let Kubernetes automatically handle intermittent fluctuations in volume.
Kubernetes can watch over your pods and scale them when the CPU utilization or some other metric crosses a threshold. The autoscaling resource specifies the details (percentage of CPU, how often to check) and the corresponding autoscale controller adjusts the number of replicas, if needed.
The following diagram illustrates the different players and their relationships:
As you can see, the horizontal pod autoscaler doesn't create or destroy pods directly. It relies instead on the replication controller or deployment resources. This is very smart because you don't need to deal with situations where autoscaling conflicts with the replication controller or deployments trying to scale the number of pods, unaware of the autoscaler efforts.
The autoscaler automatically does what we had to do ourselves before. Without the autoscaler, if we had a replication controller with replicas set to 3
, but we determined that based on average CPU utilization we actually needed 4
, then we would update the replication controller from 3
to 4
and keep monitoring the CPU utilization manually in all pods. The autoscaler will do it for us.
To declare a horizontal pod autoscaler, we need a replication controller, or a deployment, and an autoscaling resource. Here is a simple replication controller configured to maintain 3
nginx
pods:
apiVersion: v1 kind: ReplicationController metadata: name: nginx spec: replicas: 3 template: metadata: labels: run: nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80
The autoscaling
resource references the Nginx replication controller in scaleTargetRef
:
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: nginx namespace: default spec: maxReplicas: 4 minReplicas: 2 targetCPUUtilizationPercentage: 90 scaleTargetRef: apiVersion: v1 kind: ReplicationController name: nginx
The minReplicas
and maxReplicas
specify the range of scaling. This is needed to avoid runaway situations that could occur because of some problem. Imagine that, due to some bug, every pod immediately uses 100% CPU regardless of the actual load. Without the maxReplicas
limit, Kubernetes will keep creating more and more pods until all cluster resources are exhausted. If we are running in a cloud environment with autoscaling of VMs then we will incur a significant cost. The other side of this problem is that, if there is no minReplicas
and there is a lull in activity, then all pods could be terminated, and when new requests come in all the pods will have to be created and scheduled again. If there are patterns of on and off activity, then this cycle can repeat multiple times. Keeping the minimum of replicas running can smooth this phenomenon. In the preceding example, minReplicas
is set to 2
and maxReplicas
is set to 4
. Kubernetes will ensure that there are always between 2
to 4
Nginx instances running.
The target CPU utilization percentage is a mouthful. Let's abbreviate it to TCUP. You specify a single number, but Kubernetes doesn't start scaling up and down immediately when the threshold is crossed. This could lead to constant thrashing if the average load hovers around the TCUP. Instead, Kubernetes has a tolerance, which is currently (Kubernetes 1.5) hardcoded to 0.1. That means that, if TCUP is 90%, then scaling up will occur only when average CPU utilization goes above 99% (90 + 0.1 * 90) and scaling down will occur only if average CPU utilization goes below 81%.
CPU utilization is an important metric to gauge if pods that are bombarded with too many requests should be scaled up, or if they are mostly idle and can be scaled down. But CPU is not the only and sometimes not even the best metric to keep track of. Memory may be the limiting factor, or even more specialized metrics, such as the depth of a pod's internal on-disk queue, the average latency on a request, or the average number of service timeouts.
The horizontal pod custom metrics are an alpha extension added in version 1.2. The ENABLE_CUSTOM_METRICS
environment variable must be set to true
when the cluster is started to enable custom metrics. Since it's an alpha feature, it is specified as annotations in the autoscaler spec.
Kubernetes requires that the custom metrics have a cAdvisor endpoint configured. This is a standard interface that Kubernetes understands. When you're exposing your application metrics as a cAdvisor metrics endpoint, Kubernetes can work with your metrics just like it works with its own built-in metrics. The mechanism to configure the custom metrics endpoint is to create a ConfigMap
with a definition.json
file that will be consumed as a volume mounted at /etc/custom-metrics
.
Here is a sample ConfigMap
:
apiVersion: v1 kind: ConfigMap metadata: name: cm-config data: definition.json: "{"endpoint" : "http://localhost:8080/metrics"}"
Since cAdvisor operates at the node level, the localhost endpoint is a node endpoint that requires the containers inside the pod to request both a host port and a container port:
ports: - hostPort: 8080 containerPort: 8080
The custom metrics are specified as annotations due to the beta status of the feature. When custom metrics reach v1 status they will be added as regular fields.
The value in the annotation is interpreted as a target metric value averaged over all running pods. For example, a queries per second (qps) custom metric can be added as follows:
annotations: alpha/target.custom-metrics.podautoscaler.kubernetes.io: '{"items":[{"name":"qps", "value": "10"}]}'
At this point, the custom metrics can be handled just like the built-in CPU utilization percentage. If the average value across all pods exceeds the target value, then more pods will be added up to the max limit. If the average value drops below the target value, then pods will be destroyed up to the minimum.
When multiple metrics are present, the horizontal pod autoscaler will scale up to satisfy the most demanding one. For example, if metric A can be satisfied by three pods and metric B can be satisfied by four pods, then the pods will be scaled up to four replicas.
By default, the target CPU percentage is 80. Sometimes, CPU can be all over the place, and you may want to scale your pods based on some other metric. To make the CPU irrelevant for autoscaling decisions, you can set it to a ludicrous value that will never be reached, such as 999,999. Now, the autoscaler will only consider the other metrics because CPU utilization will always be below the target CPU utilization.
Kubectl can create an autoscale resource using the standard create
command accepting a configuration file. But Kubectl also has a special command, autoscale
, that lets you easily set an autoscaler in one command without a special configuration file.
bash
loop:apiVersion: v1 kind: ReplicationController metadata: name: bash-loop-rc spec: replicas: 3 template: metadata: labels: name: bash-loop-rc spec: containers: - name: bash-loop image: ubuntu command: ["/bin/bash", "-c", "while true; do sleep 10; done"]
> kubectl create -f bash-loop-rc.yaml replicationcontroller "bash-loop-rc" created
> kubectl get rc NAME DESIRED CURRENT READY AGE bash-loop-rc 3 3 3 1m
> kubectl get pods NAME READY STATUS RESTARTS AGE bash-loop-rc-61k87 1/1 Running 0 50s bash-loop-rc-7bdtz 1/1 Running 0 50s bash-loop-rc-smfrt 1/1 Running 0 50s
4
and the maximum number to 6
:> kubectl autoscale rc bash-loop-rc --min=4 --max=6 --cpu-percent=50 replicationcontroller "bash-loop-rc" autoscaled
hpa
). It shows the referenced replication controller, the target and current CPU percentage, and the min
/max
pods. The name matches the referenced replication controller:> kubectl get hpa NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE bash-loop-rc bash-loop-rc 50% 0% 4 6 7s
> kubectl get rc NAME DESIRED CURRENT READY AGE bash-loop-rc 4 4 4 7m
58
seconds old) that was created because of the autoscaling:> kubectl get pods NAME READY STATUS RESTARTS AGE bash-loop-rc-61k87 1/1 Running 0 8m bash-loop-rc-7bdtz 1/1 Running 0 8m bash-loop-rc-smfrt 1/1 Running 0 8m bash-loop-rc-z0xrl 1/1 Running 0 58s
> kubectl delete hpa bash-loop-rc horizontalpodautoscaler "bash-loop-rc" deleted
> kubectl get rc NAME DESIRED CURRENT READY AGE bash-loop-rc 4 4 4 9m
Let's try something else. What happens if we create a new horizontal pod autoscaler with a range of 2
to 6
and the same CPU target of 50
%?
> kubectl autoscale rc bash-loop-rc --min=2 --max=6 --cpu-percent=50 replicationcontroller "bash-loop-rc" autoscaled
Well, the replication controller still maintains its four replicas, which is within the range:
> kubectl get rc NAME DESIRED CURRENT READY AGE bash-loop-rc 4 4 4 9m
However, the actual CPU utilization is zero, or close to zero. The replica count should have been scaled down to two replicas. Let's check out the horizontal pod autoscaler itself:
> kubectl get hpa NAME REFERENCE TARGET CURRENT MINPODS MAXPODS AGE bash-loop-rc bash-loop-rc 50% <waiting> 2 6 1m
The secret is in the current CPU metric, which is <waiting>
. That means that the autoscaler hasn't received up-to-date information from Heapster yet, so it has no reason to scale the number of replicas in the replication controller.