Chapter 9. Rolling Updates, Scalability, and Quotas

In this chapter, we will explore the automated pod scalability that Kubernetes provides, how it affects rolling updates, and how it interacts with quotas. We will touch on the important topic of provisioning and how to choose and manage the size of the cluster. Finally, we will go over how the Kubernetes team tests the limits of Kubernetes with a 2,000 node cluster. Here are the main points we will cover:

  • Horizontal pod autoscaling
  • Performing rolling updates with autoscaling
  • Handling scarce resources with quotas and limits
  • Pushing the envelope with Kubernetes performance

At the end of this chapter, you will have the ability to plan a large-scale cluster, provision it economically, and make informed decisions about the various trade-offs between performance, cost, and availability. You will also understand how to set up horizontal pod auto-scaling and use resource quotas intelligently to let Kubernetes automatically handle intermittent fluctuations in volume.

Horizontal pod autoscaling

Kubernetes can watch over your pods and scale them when the CPU utilization or some other metric crosses a threshold. The autoscaling resource specifies the details (percentage of CPU, how often to check) and the corresponding autoscale controller adjusts the number of replicas, if needed.

The following diagram illustrates the different players and their relationships:

Horizontal pod autoscaling

As you can see, the horizontal pod autoscaler doesn't create or destroy pods directly. It relies instead on the replication controller or deployment resources. This is very smart because you don't need to deal with situations where autoscaling conflicts with the replication controller or deployments trying to scale the number of pods, unaware of the autoscaler efforts.

The autoscaler automatically does what we had to do ourselves before. Without the autoscaler, if we had a replication controller with replicas set to 3, but we determined that based on average CPU utilization we actually needed 4, then we would update the replication controller from 3 to 4 and keep monitoring the CPU utilization manually in all pods. The autoscaler will do it for us.

Declaring horizontal pod autoscaler

To declare a horizontal pod autoscaler, we need a replication controller, or a deployment, and an autoscaling resource. Here is a simple replication controller configured to maintain 3 nginx pods:

apiVersion: v1
 kind: ReplicationController
 metadata:
   name: nginx
 spec:
   replicas: 3
   template:
     metadata:
       labels:
         run: nginx
     spec:
       containers:
       - name: nginx
         image: nginx
         ports:
         - containerPort: 80

The autoscaling resource references the Nginx replication controller in scaleTargetRef:

 apiVersion: autoscaling/v1
 kind: HorizontalPodAutoscaler
 metadata:
   name: nginx
   namespace: default
 spec:
   maxReplicas: 4
   minReplicas: 2
   targetCPUUtilizationPercentage: 90
   scaleTargetRef:
     apiVersion: v1
     kind: ReplicationController
     name: nginx

The minReplicas and maxReplicas specify the range of scaling. This is needed to avoid runaway situations that could occur because of some problem. Imagine that, due to some bug, every pod immediately uses 100% CPU regardless of the actual load. Without the maxReplicas limit, Kubernetes will keep creating more and more pods until all cluster resources are exhausted. If we are running in a cloud environment with autoscaling of VMs then we will incur a significant cost. The other side of this problem is that, if there is no minReplicas and there is a lull in activity, then all pods could be terminated, and when new requests come in all the pods will have to be created and scheduled again. If there are patterns of on and off activity, then this cycle can repeat multiple times. Keeping the minimum of replicas running can smooth this phenomenon. In the preceding example, minReplicas is set to 2 and maxReplicas is set to 4. Kubernetes will ensure that there are always between 2 to 4 Nginx instances running.

The target CPU utilization percentage is a mouthful. Let's abbreviate it to TCUP. You specify a single number, but Kubernetes doesn't start scaling up and down immediately when the threshold is crossed. This could lead to constant thrashing if the average load hovers around the TCUP. Instead, Kubernetes has a tolerance, which is currently (Kubernetes 1.5) hardcoded to 0.1. That means that, if TCUP is 90%, then scaling up will occur only when average CPU utilization goes above 99% (90 + 0.1 * 90) and scaling down will occur only if average CPU utilization goes below 81%.

Custom metrics

CPU utilization is an important metric to gauge if pods that are bombarded with too many requests should be scaled up, or if they are mostly idle and can be scaled down. But CPU is not the only and sometimes not even the best metric to keep track of. Memory may be the limiting factor, or even more specialized metrics, such as the depth of a pod's internal on-disk queue, the average latency on a request, or the average number of service timeouts.

The horizontal pod custom metrics are an alpha extension added in version 1.2. The ENABLE_CUSTOM_METRICS environment variable must be set to true when the cluster is started to enable custom metrics. Since it's an alpha feature, it is specified as annotations in the autoscaler spec.

Kubernetes requires that the custom metrics have a cAdvisor endpoint configured. This is a standard interface that Kubernetes understands. When you're exposing your application metrics as a cAdvisor metrics endpoint, Kubernetes can work with your metrics just like it works with its own built-in metrics. The mechanism to configure the custom metrics endpoint is to create a ConfigMap with a definition.json file that will be consumed as a volume mounted at /etc/custom-metrics.

Here is a sample ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-config
data:
  definition.json: "{"endpoint" : "http://localhost:8080/metrics"}"

Since cAdvisor operates at the node level, the localhost endpoint is a node endpoint that requires the containers inside the pod to request both a host port and a container port:

    ports:
    - hostPort: 8080
      containerPort: 8080

The custom metrics are specified as annotations due to the beta status of the feature. When custom metrics reach v1 status they will be added as regular fields.

The value in the annotation is interpreted as a target metric value averaged over all running pods. For example, a queries per second (qps) custom metric can be added as follows:

    annotations:
      alpha/target.custom-metrics.podautoscaler.kubernetes.io: '{"items":[{"name":"qps", "value": "10"}]}'

At this point, the custom metrics can be handled just like the built-in CPU utilization percentage. If the average value across all pods exceeds the target value, then more pods will be added up to the max limit. If the average value drops below the target value, then pods will be destroyed up to the minimum.

When multiple metrics are present, the horizontal pod autoscaler will scale up to satisfy the most demanding one. For example, if metric A can be satisfied by three pods and metric B can be satisfied by four pods, then the pods will be scaled up to four replicas.

By default, the target CPU percentage is 80. Sometimes, CPU can be all over the place, and you may want to scale your pods based on some other metric. To make the CPU irrelevant for autoscaling decisions, you can set it to a ludicrous value that will never be reached, such as 999,999. Now, the autoscaler will only consider the other metrics because CPU utilization will always be below the target CPU utilization.

Autoscaling with Kubectl

Kubectl can create an autoscale resource using the standard create command accepting a configuration file. But Kubectl also has a special command, autoscale, that lets you easily set an autoscaler in one command without a special configuration file.

  1. First, let's start a replication controller that makes sure there are three replicas of a simple pod that just runs an infinite bash loop:
    apiVersion: v1
    kind: ReplicationController
    metadata:
       name: bash-loop-rc
    spec:
       replicas: 3
       template:
         metadata:
           labels:
             name: bash-loop-rc
         spec:
           containers:
             - name: bash-loop
               image: ubuntu
               command: ["/bin/bash", "-c", "while true; do sleep 10;  
                          done"]
    
  2. Let's create a replication controller:
    > kubectl create -f bash-loop-rc.yaml
    replicationcontroller "bash-loop-rc" created
    
  3. Here is the resulting replication controller:
    > kubectl get rc
    NAME           DESIRED   CURRENT   READY     AGE
    bash-loop-rc   3         3         3         1m
    
  4. You can see that the desired and current count are both three, meaning three pods are running. Let's make sure:
    > kubectl get pods
    NAME                 READY     STATUS    RESTARTS   AGE
    bash-loop-rc-61k87   1/1       Running   0          50s
    bash-loop-rc-7bdtz   1/1       Running   0          50s
    bash-loop-rc-smfrt   1/1       Running   0          50s
    
  5. Now, let's create an autoscaler. To make it interesting, we'll set the minimum number of replicas to 4 and the maximum number to 6:
    > kubectl  autoscale rc bash-loop-rc --min=4 --max=6 --cpu-percent=50
    replicationcontroller "bash-loop-rc" autoscaled
    
  6. Here is the resulting horizontal pod autoscaler (you can use hpa). It shows the referenced replication controller, the target and current CPU percentage, and the min/max pods. The name matches the referenced replication controller:
    > kubectl get hpa
    NAME           REFERENCE    TARGET CURRENT MINPODS   MAXPODS   AGE
    bash-loop-rc   bash-loop-rc 50%    0%      4         6         7s
    
  7. Originally, the replication controller was set to have three replicas, but the autoscaler has a minimum of four pods. What's the effect on the replication controller? That's right. Now the desired number of replicas is four. If the average CPU utilization goes above 50%, then it may climb to five or even six:
    > kubectl get rc
    NAME           DESIRED   CURRENT   READY     AGE
    bash-loop-rc   4         4         4         7m
    
  8. Just to make sure everything works, here is another look at the pods. Note the new pod (58 seconds old) that was created because of the autoscaling:
    > kubectl get pods
    NAME                 READY     STATUS    RESTARTS   AGE
    bash-loop-rc-61k87   1/1       Running   0          8m
    bash-loop-rc-7bdtz   1/1       Running   0          8m
    bash-loop-rc-smfrt   1/1       Running   0          8m
    bash-loop-rc-z0xrl   1/1       Running   0          58s
    
  9. When we delete the horizontal pod autoscaler, the replication controller retains the last desired number of replicas (four in this case). Nobody remembers that the replication controller was created with three replicas:
    > kubectl  delete hpa bash-loop-rc
    horizontalpodautoscaler "bash-loop-rc" deleted
    
  10. As you can see, the replication controller wasn't reset and still maintains four pods even when the autoscaler is gone:
    > kubectl get rc
    NAME           DESIRED   CURRENT   READY     AGE
    bash-loop-rc   4         4         4         9m
    

Let's try something else. What happens if we create a new horizontal pod autoscaler with a range of 2 to 6 and the same CPU target of 50%?

> kubectl autoscale rc bash-loop-rc --min=2 --max=6 --cpu-percent=50
replicationcontroller "bash-loop-rc" autoscaled

Well, the replication controller still maintains its four replicas, which is within the range:

> kubectl get rc
NAME           DESIRED   CURRENT   READY     AGE
bash-loop-rc   4         4         4         9m

However, the actual CPU utilization is zero, or close to zero. The replica count should have been scaled down to two replicas. Let's check out the horizontal pod autoscaler itself:

> kubectl get hpa
NAME           REFERENCE    TARGET CURRENT   MINPODS MAXPODS   AGE
bash-loop-rc   bash-loop-rc 50%    <waiting> 2       6         1m

The secret is in the current CPU metric, which is <waiting>. That means that the autoscaler hasn't received up-to-date information from Heapster yet, so it has no reason to scale the number of replicas in the replication controller.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset